|
|
-
UDF Performance Problem
James Newhaven 2012-09-03, 16:32
Hi,
I'd appreciate if anyone has some ideas/pointers regarding a pig script and custom UDF I have written. I've found it runs too slowly on my hadoop cluster to be useful.......
I have two million records inside a single 600MB file.
For each record, I need to query a web service to retrieve additional data for this record.
The web service supports batch requests of up to 50 records.
I split the two million records into bags of 50 items (using the datafu BagSplit UDF) and then pass each bag on to a custom UDF I have written that processes each bag and queries the web service.
I noticed when my script reaches my UDF, only one reducer is used and the job takes forever to complete (in fact it has never finished since I terminate it after a few hours).
My script looks like this:
A = LOAD 'records.txt' USING PigStorage('\t') AS (recordId:int); B = GROUP B ALL; SPLITS = FOREACH B GENERATE Flatten(BagSplit(50,A)); COMPLETE_RCORDS = FOREACH SPLITS GENERATE FLATTEN(MyCustomUDF($0));
Thanks,
James
+
James Newhaven 2012-09-03, 16:32
-
Re: UDF Performance Problem
Dmitriy Ryaboy 2012-09-03, 17:21
That's cause you used "group all" which groups everything into one group, which by definition can only go to one reducer.
What if instead you group into some large-enough number of buckets?
A = LOAD 'records.txt' USING PigStorage('\t') AS (recordId:int); A_PRIME = FOREACH A generate *, ROUND(RANDOM() * 1000) as bucket; B = GROUP A_PRIME by bucket PARALLEL $parallelism; SPLITS = FOREACH B GENERATE Flatten(BagSplit(50,A_PRIME)); COMPLETE_RCORDS = FOREACH SPLITS GENERATE FLATTEN(MyCustomUDF($0));
D On Mon, Sep 3, 2012 at 9:32 AM, James Newhaven <[EMAIL PROTECTED]> wrote: > Hi, > > I'd appreciate if anyone has some ideas/pointers regarding a pig script and > custom UDF I have written. I've found it runs too slowly on my hadoop > cluster to be useful....... > > I have two million records inside a single 600MB file. > > For each record, I need to query a web service to retrieve additional data > for this record. > > The web service supports batch requests of up to 50 records. > > I split the two million records into bags of 50 items (using the datafu > BagSplit UDF) and then pass each bag on to a custom UDF I have written that > processes each bag and queries the web service. > > I noticed when my script reaches my UDF, only one reducer is used and the > job takes forever to complete (in fact it has never finished since I > terminate it after a few hours). > > My script looks like this: > > A = LOAD 'records.txt' USING PigStorage('\t') AS (recordId:int); > B = GROUP B ALL; > SPLITS = FOREACH B GENERATE Flatten(BagSplit(50,A)); > COMPLETE_RCORDS = FOREACH SPLITS GENERATE FLATTEN(MyCustomUDF($0)); > > Thanks, > > James
+
Dmitriy Ryaboy 2012-09-03, 17:21
-
Re: UDF Performance Problem
James Newhaven 2012-09-03, 20:31
Thanks Dmitriy, all sorted now.
James
On Mon, Sep 3, 2012 at 6:21 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> That's cause you used "group all" which groups everything into one > group, which by definition can only go to one reducer. > > What if instead you group into some large-enough number of buckets? > > A = LOAD 'records.txt' USING PigStorage('\t') AS (recordId:int); > A_PRIME = FOREACH A generate *, ROUND(RANDOM() * 1000) as bucket; > B = GROUP A_PRIME by bucket PARALLEL $parallelism; > SPLITS = FOREACH B GENERATE Flatten(BagSplit(50,A_PRIME)); > COMPLETE_RCORDS = FOREACH SPLITS GENERATE FLATTEN(MyCustomUDF($0)); > > D > > > On Mon, Sep 3, 2012 at 9:32 AM, James Newhaven <[EMAIL PROTECTED]> > wrote: > > Hi, > > > > I'd appreciate if anyone has some ideas/pointers regarding a pig script > and > > custom UDF I have written. I've found it runs too slowly on my hadoop > > cluster to be useful....... > > > > I have two million records inside a single 600MB file. > > > > For each record, I need to query a web service to retrieve additional > data > > for this record. > > > > The web service supports batch requests of up to 50 records. > > > > I split the two million records into bags of 50 items (using the datafu > > BagSplit UDF) and then pass each bag on to a custom UDF I have written > that > > processes each bag and queries the web service. > > > > I noticed when my script reaches my UDF, only one reducer is used and the > > job takes forever to complete (in fact it has never finished since I > > terminate it after a few hours). > > > > My script looks like this: > > > > A = LOAD 'records.txt' USING PigStorage('\t') AS (recordId:int); > > B = GROUP B ALL; > > SPLITS = FOREACH B GENERATE Flatten(BagSplit(50,A)); > > COMPLETE_RCORDS = FOREACH SPLITS GENERATE FLATTEN(MyCustomUDF($0)); > > > > Thanks, > > > > James >
+
James Newhaven 2012-09-03, 20:31
|
|