Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> pig script takes much longer than java MR job


Copy link to this message
-
Re: pig script takes much longer than java MR job
Yeah sounds like a lot to dump if it takes 15 minutes to run. That alone can take long time.
 
I once forgot to comment out some debug line in my udf. When run with production data, not only it's slow, it blew up the cluster - simply run out of log space :)

On Jun 17, 2011, at 5:06 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote:

> A couple of possibilities that I'm kicking around off the top of my head...
>
> 1) Does your MR job also sort afterwards? That's going to kick off another
> MR job
> 2) Does your MR job compile all the results into one job?
>
> My guess is the Order+Dump are making it take longer.
>
> 2011/6/17 Sujee Maniyam <[EMAIL PROTECTED]>
>
>> I have log files like this:
>>  #timestamp (ms),     server,    user,    action,    domain , x,    y ,
>> z
>>  1262332800008, 7, 50817, 2, yahoo.com, 31, blahblah, foobar
>>  1262332800017, 2, 373168, 0, google.com, 67, blahblah, foobar
>>  1262332800025, 8, 172910, 1, facebook.com, 135, blahblah, foobar
>>
>> I have the following pig script to count the number of domains from logs. (
>> For example, we have seen facebook.com 10 times ..etc.)
>>
>> Here is the pig script:
>>
>> --------------------------------
>> records = LOAD '/logs-in/*.log' using PigStorage(',') AS (ts:long,
>> server:int, user:int, action_id:int, domain:chararray, price:int);
>>
>> -- DUMP records;
>> grouped_by_domain = GROUP records BY domain;
>> -- DUMP grouped_by_domain;
>> -- DESCRIBE grouped_by_domain;
>>
>> freq = FOREACH grouped_by_domain GENERATE group as domain, COUNT(records)
>> as
>> mycount;
>> -- DESCRIBE freq;
>> -- DUMP freq;
>>
>> sorted = ORDER freq BY mycount DESC;
>> DUMP sorted;
>> --------------------------------
>>
>> This script takes a hour to run.   I also wrote a simple Java MR job to
>> count the domains, it takes about 15 mins.  So the pig script is taking 4x
>> longer to complete.
>>
>> any suggestions on what I am doing wrong in pig?
>>
>> thanks
>> Sujee
>> http://sujee.net
>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB