Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Is there a way to set reducer number of pig besides using parallel keyword?


+
Hui Qi 2011-10-12, 18:35
+
Dmitriy Ryaboy 2011-10-12, 21:02
Copy link to this message
-
Re: Is there a way to set reducer number of pig besides using parallel keyword?
Norbert Burger 2011-10-12, 21:08
For a more detailed explanation, take a look also at
http://pig.apache.org/docs/r0.8.1/cookbook.html#Use+the+Parallel+Features.

In summary:

* The PARALLEL keyword at the operator level overrides any other setting
* SET default_parallel determines reducer count for all blocking operators
(ones that force a reduce phase)
* If neither of these are set, then reducer count is determined via a
heuristic based on total input size

Norbert

On Wed, Oct 12, 2011 at 5:02 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:

> set default_parallel 8
>
> -D
>
> On Wed, Oct 12, 2011 at 11:35 AM, Hui Qi <[EMAIL PROTECTED]> wrote:
>
> > Hi,
> > I try to set a reducer number in the following way:
> > java -Dmapred.reduce.tasks=8 -cp pig.jar:$HADOOP_HOME/conf
> > org.apache.pig.Main ./L1.pig
> >
> > but it doesn't work, the reducers number remain the same the as 40, which
> > is
> > the parallel number in L1.pig.(L1.pig is from pigmix).
> > If I delete the parallel 40 in the script, the reduce.tasks will be 2,
> > which
> > I thought to be 1.
> >
> > L1.pig:
> > -- This script tests reading from a map, flattening a bag of maps, and
> use
> > of bincond.
> > register pigperf.jar;
> > A = load '/user/pig/tests/data/pigmix/page_views' using
> > org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
> >    as (user, action, timespent, query_term, ip_addr, timestamp,
> >        estimated_revenue, page_info, page_links);
> > B = foreach A generate user, (int)action as action, (map[])page_info as
> > page_info,
> >    flatten((bag{tuple(map[])})page_links) as page_links;
> > C = foreach B generate user,
> >    (action == 1 ? page_info#'a' : page_links#'b') as header;
> > D = group C by user parallel 40;
> > E = foreach D generate group, COUNT(C) as cnt;
> > store E into 'L1out';
> >
> > Best,
> > Hui
> >
>
+
Andrew Clegg 2011-10-12, 22:47
+
Dmitriy Ryaboy 2011-10-13, 00:41