Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> replicated join gets extra job


Copy link to this message
-
replicated join gets extra job
Hi,

I'm running a job like this:

raw_large = LOAD 'lots_of_files' AS (...);
raw_filtered = FILTER raw_large BY ...;
large_table = FOREACH raw_filtered GENERATE f1, f2, f3,....;

joined_1 = JOIN large_table BY (key1) LEFT, config_table_1  BY (key2) USING
'replicated';
joined_2 = JOIN join1          BY (key3) LEFT, config_table_2  BY (key4)
USING 'replicated';
joined_3 = JOIN join2          BY (key5) LEFT, config_table_3  BY (key6)
USING 'replicated';
joined_4 = JOIN join4          BY (key7) LEFT, config_table_3  BY (key8)
USING 'replicated';

basically left join a large table with 4 relatively small tables using the
replicated join.

I see a first load job has 120 mapper tasks and no reducer, and this job
seems to be doing the load and filtering. And there is another job
following that has 26 mapper tasks that seem to be doing the joins.

Shouldn't there be only one job and the joins being done in the mapper
phase of the first job?

The 4 config tables (files) have these sizes respectively:

3MB
220kB
2kB
100kB

these are running on AWS EMR Pig 0.92 on xlarge instances which has 15GB
memory.

Thanks!