Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> how to control nested CROSS parallelism?


Copy link to this message
-
Re: how to control nested CROSS parallelism?
It's strange that it's being executed on the Map-side. The group is a
reduce side operation (I'm assuming) and it seems that the nested foreach
would happen on Reduce-side after grouping. Have you looked at the MR plan
to verify that it is being executed Map-side?

One thing to try might be to CROSS first before grouping... although that
might be 2 reduce steps.
On Mon, Jan 20, 2014 at 1:27 AM, Serega Sheypak <[EMAIL PROTECTED]>wrote:

> Hi, I'm in trouble
> Here a part of code:
>
> itemGrp = GROUP itemProj1 BY sale_id PARALLEL 12;
> notFiltered = FOREACH itemGrp{
>                 itemProj2 = FOREACH itemProj1
>                                    GENERATE FLATTEN(
>                                     TOTUPLE(id, other_id)) as
>                                            (id, other_id);
>
>                 crossed = CROSS itemProj1, itemProj2;
>                 filtered =  FILTER crossed by (
>                                                 --some cond
>                                                );
>                 projected = FOREACH filtered GENERATE f1, f2, f3;
>                 GENERATE FLATTEN(projected) as (f1, f2,f3);
> }
>
> The problem is that all this stuff is executed on map phase. But i want it
> to be executed on reduce phase to get parallelism benfit.
> Now only two mappers (not to much data before CROSS explosion) perform
> cross inside groups and complicated filtering.
>
> I can't find a way to make it run on reduce-phase...
> What do I do wrong?
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB