Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> UDF Performance Problem


+
James Newhaven 2012-09-03, 16:32
+
Dmitriy Ryaboy 2012-09-03, 17:21
Copy link to this message
-
Re: UDF Performance Problem
Thanks Dmitriy, all sorted now.

James

On Mon, Sep 3, 2012 at 6:21 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:

> That's cause you used "group all" which groups everything into one
> group, which by definition can only go to one reducer.
>
> What if instead you group into some large-enough number of buckets?
>
> A = LOAD 'records.txt'  USING PigStorage('\t') AS (recordId:int);
> A_PRIME = FOREACH A generate *, ROUND(RANDOM() * 1000) as bucket;
>  B = GROUP A_PRIME by bucket PARALLEL $parallelism;
> SPLITS = FOREACH B GENERATE Flatten(BagSplit(50,A_PRIME));
> COMPLETE_RCORDS = FOREACH SPLITS GENERATE FLATTEN(MyCustomUDF($0));
>
> D
>
>
> On Mon, Sep 3, 2012 at 9:32 AM, James Newhaven <[EMAIL PROTECTED]>
> wrote:
> > Hi,
> >
> > I'd appreciate if anyone has some ideas/pointers regarding a pig script
> and
> > custom UDF I have written. I've found it runs too slowly on my hadoop
> > cluster to be useful.......
> >
> > I have two million records inside a single 600MB file.
> >
> > For each record, I need to query a web service to retrieve additional
> data
> > for this record.
> >
> > The web service supports batch requests of up to 50 records.
> >
> > I split the two million records into bags of 50 items (using the datafu
> > BagSplit UDF) and then pass each bag on to a custom UDF I have written
> that
> > processes each bag and queries the web service.
> >
> > I noticed when my script reaches my UDF, only one reducer is used and the
> > job takes forever to complete (in fact it has never finished since I
> > terminate it after a few hours).
> >
> > My script looks like this:
> >
> > A = LOAD 'records.txt'  USING PigStorage('\t') AS (recordId:int);
> > B = GROUP B ALL;
> > SPLITS = FOREACH B GENERATE Flatten(BagSplit(50,A));
> > COMPLETE_RCORDS = FOREACH SPLITS GENERATE FLATTEN(MyCustomUDF($0));
> >
> > Thanks,
> >
> > James
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB