Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - how to control nested CROSS parallelism?


Copy link to this message
-
Re: how to control nested CROSS parallelism?
Pradeep Gollakota 2014-01-20, 18:27
It's strange that it's being executed on the Map-side. The group is a
reduce side operation (I'm assuming) and it seems that the nested foreach
would happen on Reduce-side after grouping. Have you looked at the MR plan
to verify that it is being executed Map-side?

One thing to try might be to CROSS first before grouping... although that
might be 2 reduce steps.
On Mon, Jan 20, 2014 at 1:27 AM, Serega Sheypak <[EMAIL PROTECTED]>wrote:

> Hi, I'm in trouble
> Here a part of code:
>
> itemGrp = GROUP itemProj1 BY sale_id PARALLEL 12;
> notFiltered = FOREACH itemGrp{
>                 itemProj2 = FOREACH itemProj1
>                                    GENERATE FLATTEN(
>                                     TOTUPLE(id, other_id)) as
>                                            (id, other_id);
>
>                 crossed = CROSS itemProj1, itemProj2;
>                 filtered =  FILTER crossed by (
>                                                 --some cond
>                                                );
>                 projected = FOREACH filtered GENERATE f1, f2, f3;
>                 GENERATE FLATTEN(projected) as (f1, f2,f3);
> }
>
> The problem is that all this stuff is executed on map phase. But i want it
> to be executed on reduce phase to get parallelism benfit.
> Now only two mappers (not to much data before CROSS explosion) perform
> cross inside groups and complicated filtering.
>
> I can't find a way to make it run on reduce-phase...
> What do I do wrong?
>