Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> UDF Performance Problem


Copy link to this message
-
UDF Performance Problem
Hi,

I'd appreciate if anyone has some ideas/pointers regarding a pig script and
custom UDF I have written. I've found it runs too slowly on my hadoop
cluster to be useful.......

I have two million records inside a single 600MB file.

For each record, I need to query a web service to retrieve additional data
for this record.

The web service supports batch requests of up to 50 records.

I split the two million records into bags of 50 items (using the datafu
BagSplit UDF) and then pass each bag on to a custom UDF I have written that
processes each bag and queries the web service.

I noticed when my script reaches my UDF, only one reducer is used and the
job takes forever to complete (in fact it has never finished since I
terminate it after a few hours).

My script looks like this:

A = LOAD 'records.txt'  USING PigStorage('\t') AS (recordId:int);
B = GROUP B ALL;
SPLITS = FOREACH B GENERATE Flatten(BagSplit(50,A));
COMPLETE_RCORDS = FOREACH SPLITS GENERATE FLATTEN(MyCustomUDF($0));

Thanks,

James
+
Dmitriy Ryaboy 2012-09-03, 17:21
+
James Newhaven 2012-09-03, 20:31
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB