Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Unique Self Cross Optimization


Copy link to this message
-
Unique Self Cross Optimization
I have a data input of aliases and many identifying attributes per each
alias. The order of aliases is ~1E8 and for all attributes is ~1E5.  I am
attempting to generate a network of alias-alias commutative parings which
share at least one attribute in common.  For the rotation, a vast majority
of the attributes contain a relatively small number of corresponding
aliases ~1E3 - except for a few, whereas these <1% of attributes have
corresponding aliases on the order of the entire input alias set ~1E8.

I am running into an issue with respect to these large alias <1% attributes
tasks. The reducers for some of these tasks are taking many orders of
magnitude longer to complete than the other 99% (on the order of many hours
to minutes).  A representation of the script is below (Pig 0.11.2):

SET default_parallel $REDUCERS;
SET pig.schematuple true;
SET pig.exec.mapPartAgg true;
SET output.compression.enabled true;
SET output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
X = LOAD '$INPUT/user_item' USING PigStorage() AS (alias:chararray,
attributeURI:chararray);
A1 = FOREACH X GENERATE *;
A2 = FOREACH X GENERATE *;
A3 = JOIN A1 BY (attributeURI), A2 BY (attributeURI);
A4 = FILTER A3 BY (A1::alias != A2::alias);
A5 = FOREACH A4 GENERATE A1::alias, A2::alias; --projection bc X contains
other fields not shown here
A6 = DISTINCT A5;
STORE A6 INTO '$OUTPUT/network' USING PigStorage();

Here, Reducer steps A4, A5 are taking forever on a handful of reducer
tasks, likely related to the <1% attributes issues described above.  Is
there a better way to optimize this script?

An example of the input X:
aa, cat
aa, dog
bb, dog
bb, bear
cc, cat
dd, bird

An example of the output A6:
aa, bb
aa, cc
aa, dd
bb, aa
cc, aa

Many Thanks.  -Dan