Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - pig script takes much longer than java MR job


Copy link to this message
-
Re: pig script takes much longer than java MR job
Dexin Wang 2011-06-18, 00:34
Yeah sounds like a lot to dump if it takes 15 minutes to run. That alone can take long time.
 
I once forgot to comment out some debug line in my udf. When run with production data, not only it's slow, it blew up the cluster - simply run out of log space :)

On Jun 17, 2011, at 5:06 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote:

> A couple of possibilities that I'm kicking around off the top of my head...
>
> 1) Does your MR job also sort afterwards? That's going to kick off another
> MR job
> 2) Does your MR job compile all the results into one job?
>
> My guess is the Order+Dump are making it take longer.
>
> 2011/6/17 Sujee Maniyam <[EMAIL PROTECTED]>
>
>> I have log files like this:
>>  #timestamp (ms),     server,    user,    action,    domain , x,    y ,
>> z
>>  1262332800008, 7, 50817, 2, yahoo.com, 31, blahblah, foobar
>>  1262332800017, 2, 373168, 0, google.com, 67, blahblah, foobar
>>  1262332800025, 8, 172910, 1, facebook.com, 135, blahblah, foobar
>>
>> I have the following pig script to count the number of domains from logs. (
>> For example, we have seen facebook.com 10 times ..etc.)
>>
>> Here is the pig script:
>>
>> --------------------------------
>> records = LOAD '/logs-in/*.log' using PigStorage(',') AS (ts:long,
>> server:int, user:int, action_id:int, domain:chararray, price:int);
>>
>> -- DUMP records;
>> grouped_by_domain = GROUP records BY domain;
>> -- DUMP grouped_by_domain;
>> -- DESCRIBE grouped_by_domain;
>>
>> freq = FOREACH grouped_by_domain GENERATE group as domain, COUNT(records)
>> as
>> mycount;
>> -- DESCRIBE freq;
>> -- DUMP freq;
>>
>> sorted = ORDER freq BY mycount DESC;
>> DUMP sorted;
>> --------------------------------
>>
>> This script takes a hour to run.   I also wrote a simple Java MR job to
>> count the domains, it takes about 15 mins.  So the pig script is taking 4x
>> longer to complete.
>>
>> any suggestions on what I am doing wrong in pig?
>>
>> thanks
>> Sujee
>> http://sujee.net
>>