Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> how to control nested CROSS parallelism?


Copy link to this message
-
Re: how to control nested CROSS parallelism?
ok, thanks!
2014/1/20 Pradeep Gollakota <[EMAIL PROTECTED]>

> It's strange that it's being executed on the Map-side. The group is a
> reduce side operation (I'm assuming) and it seems that the nested foreach
> would happen on Reduce-side after grouping. Have you looked at the MR plan
> to verify that it is being executed Map-side?
>
> One thing to try might be to CROSS first before grouping... although that
> might be 2 reduce steps.
>
>
> On Mon, Jan 20, 2014 at 1:27 AM, Serega Sheypak <[EMAIL PROTECTED]
> >wrote:
>
> > Hi, I'm in trouble
> > Here a part of code:
> >
> > itemGrp = GROUP itemProj1 BY sale_id PARALLEL 12;
> > notFiltered = FOREACH itemGrp{
> >                 itemProj2 = FOREACH itemProj1
> >                                    GENERATE FLATTEN(
> >                                     TOTUPLE(id, other_id)) as
> >                                            (id, other_id);
> >
> >                 crossed = CROSS itemProj1, itemProj2;
> >                 filtered =  FILTER crossed by (
> >                                                 --some cond
> >                                                );
> >                 projected = FOREACH filtered GENERATE f1, f2, f3;
> >                 GENERATE FLATTEN(projected) as (f1, f2,f3);
> > }
> >
> > The problem is that all this stuff is executed on map phase. But i want
> it
> > to be executed on reduce phase to get parallelism benfit.
> > Now only two mappers (not to much data before CROSS explosion) perform
> > cross inside groups and complicated filtering.
> >
> > I can't find a way to make it run on reduce-phase...
> > What do I do wrong?
> >
>