Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Unique Self Cross Optimization


Copy link to this message
-
Unique Self Cross Optimization
I have a data input of aliases and many identifying attributes per each
alias. The order of aliases is ~1E8 and for all attributes is ~1E5.  I am
attempting to generate a network of alias-alias commutative parings which
share at least one attribute in common.  For the rotation, a vast majority
of the attributes contain a relatively small number of corresponding
aliases ~1E3 - except for a few, whereas these <1% of attributes have
corresponding aliases on the order of the entire input alias set ~1E8.

I am running into an issue with respect to these large alias <1% attributes
tasks. The reducers for some of these tasks are taking many orders of
magnitude longer to complete than the other 99% (on the order of many hours
to minutes).  A representation of the script is below (Pig 0.11.2):

SET default_parallel $REDUCERS;
SET pig.schematuple true;
SET pig.exec.mapPartAgg true;
SET output.compression.enabled true;
SET output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
X = LOAD '$INPUT/user_item' USING PigStorage() AS (alias:chararray,
attributeURI:chararray);
A1 = FOREACH X GENERATE *;
A2 = FOREACH X GENERATE *;
A3 = JOIN A1 BY (attributeURI), A2 BY (attributeURI);
A4 = FILTER A3 BY (A1::alias != A2::alias);
A5 = FOREACH A4 GENERATE A1::alias, A2::alias; --projection bc X contains
other fields not shown here
A6 = DISTINCT A5;
STORE A6 INTO '$OUTPUT/network' USING PigStorage();

Here, Reducer steps A4, A5 are taking forever on a handful of reducer
tasks, likely related to the <1% attributes issues described above.  Is
there a better way to optimize this script?

An example of the input X:
aa, cat
aa, dog
bb, dog
bb, bear
cc, cat
dd, bird

An example of the output A6:
aa, bb
aa, cc
aa, dd
bb, aa
cc, aa

Many Thanks.  -Dan
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB