Yeah sounds like a lot to dump if it takes 15 minutes to run. That alone can take long time.
I once forgot to comment out some debug line in my udf. When run with production data, not only it's slow, it blew up the cluster - simply run out of log space :)
On Jun 17, 2011, at 5:06 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote:
> A couple of possibilities that I'm kicking around off the top of my head...
> 1) Does your MR job also sort afterwards? That's going to kick off another
> MR job
> 2) Does your MR job compile all the results into one job?
> My guess is the Order+Dump are making it take longer.
> 2011/6/17 Sujee Maniyam <[EMAIL PROTECTED]>
>> I have log files like this:
>> #timestamp (ms), server, user, action, domain , x, y ,
>> 1262332800008, 7, 50817, 2, yahoo.com, 31, blahblah, foobar
>> 1262332800017, 2, 373168, 0, google.com, 67, blahblah, foobar
>> 1262332800025, 8, 172910, 1, facebook.com, 135, blahblah, foobar
>> I have the following pig script to count the number of domains from logs. (
>> For example, we have seen facebook.com 10 times ..etc.)
>> Here is the pig script:
>> records = LOAD '/logs-in/*.log' using PigStorage(',') AS (ts:long,
>> server:int, user:int, action_id:int, domain:chararray, price:int);
>> -- DUMP records;
>> grouped_by_domain = GROUP records BY domain;
>> -- DUMP grouped_by_domain;
>> -- DESCRIBE grouped_by_domain;
>> freq = FOREACH grouped_by_domain GENERATE group as domain, COUNT(records)
>> -- DESCRIBE freq;
>> -- DUMP freq;
>> sorted = ORDER freq BY mycount DESC;
>> DUMP sorted;
>> This script takes a hour to run. I also wrote a simple Java MR job to
>> count the domains, it takes about 15 mins. So the pig script is taking 4x
>> longer to complete.
>> any suggestions on what I am doing wrong in pig?