Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - replicated join gets extra job


Copy link to this message
-
Re: replicated join gets extra job
Pradeep Gollakota 2013-11-12, 04:30
Use the ILLUSTRATE or EXPLAIN keywords to look at the details of the
physical execution plan... from first glance it doesn't look like you'd
need a 2nd job to do the joins, but if you can post the output of
ILLUSTRATE/EXPLAIN, we can look into it.
On Mon, Nov 11, 2013 at 4:36 PM, Dexin Wang <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I'm running a job like this:
>
> raw_large = LOAD 'lots_of_files' AS (...);
> raw_filtered = FILTER raw_large BY ...;
> large_table = FOREACH raw_filtered GENERATE f1, f2, f3,....;
>
> joined_1 = JOIN large_table BY (key1) LEFT, config_table_1  BY (key2) USING
> 'replicated';
> joined_2 = JOIN join1          BY (key3) LEFT, config_table_2  BY (key4)
> USING 'replicated';
> joined_3 = JOIN join2          BY (key5) LEFT, config_table_3  BY (key6)
> USING 'replicated';
> joined_4 = JOIN join4          BY (key7) LEFT, config_table_3  BY (key8)
> USING 'replicated';
>
> basically left join a large table with 4 relatively small tables using the
> replicated join.
>
> I see a first load job has 120 mapper tasks and no reducer, and this job
> seems to be doing the load and filtering. And there is another job
> following that has 26 mapper tasks that seem to be doing the joins.
>
> Shouldn't there be only one job and the joins being done in the mapper
> phase of the first job?
>
> The 4 config tables (files) have these sizes respectively:
>
> 3MB
> 220kB
> 2kB
> 100kB
>
> these are running on AWS EMR Pig 0.92 on xlarge instances which has 15GB
> memory.
>
> Thanks!
>