Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - UDF Performance Problem


Copy link to this message
-
UDF Performance Problem
James Newhaven 2012-09-03, 16:32
Hi,

I'd appreciate if anyone has some ideas/pointers regarding a pig script and
custom UDF I have written. I've found it runs too slowly on my hadoop
cluster to be useful.......

I have two million records inside a single 600MB file.

For each record, I need to query a web service to retrieve additional data
for this record.

The web service supports batch requests of up to 50 records.

I split the two million records into bags of 50 items (using the datafu
BagSplit UDF) and then pass each bag on to a custom UDF I have written that
processes each bag and queries the web service.

I noticed when my script reaches my UDF, only one reducer is used and the
job takes forever to complete (in fact it has never finished since I
terminate it after a few hours).

My script looks like this:

A = LOAD 'records.txt'  USING PigStorage('\t') AS (recordId:int);
B = GROUP B ALL;
SPLITS = FOREACH B GENERATE Flatten(BagSplit(50,A));
COMPLETE_RCORDS = FOREACH SPLITS GENERATE FLATTEN(MyCustomUDF($0));

Thanks,

James
+
Dmitriy Ryaboy 2012-09-03, 17:21
+
James Newhaven 2012-09-03, 20:31