Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Is there a way to set reducer number of pig besides using parallel keyword?


+
Hui Qi 2011-10-12, 18:35
+
Dmitriy Ryaboy 2011-10-12, 21:02
Copy link to this message
-
Re: Is there a way to set reducer number of pig besides using parallel keyword?
For a more detailed explanation, take a look also at
http://pig.apache.org/docs/r0.8.1/cookbook.html#Use+the+Parallel+Features.

In summary:

* The PARALLEL keyword at the operator level overrides any other setting
* SET default_parallel determines reducer count for all blocking operators
(ones that force a reduce phase)
* If neither of these are set, then reducer count is determined via a
heuristic based on total input size

Norbert

On Wed, Oct 12, 2011 at 5:02 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:

> set default_parallel 8
>
> -D
>
> On Wed, Oct 12, 2011 at 11:35 AM, Hui Qi <[EMAIL PROTECTED]> wrote:
>
> > Hi,
> > I try to set a reducer number in the following way:
> > java -Dmapred.reduce.tasks=8 -cp pig.jar:$HADOOP_HOME/conf
> > org.apache.pig.Main ./L1.pig
> >
> > but it doesn't work, the reducers number remain the same the as 40, which
> > is
> > the parallel number in L1.pig.(L1.pig is from pigmix).
> > If I delete the parallel 40 in the script, the reduce.tasks will be 2,
> > which
> > I thought to be 1.
> >
> > L1.pig:
> > -- This script tests reading from a map, flattening a bag of maps, and
> use
> > of bincond.
> > register pigperf.jar;
> > A = load '/user/pig/tests/data/pigmix/page_views' using
> > org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
> >    as (user, action, timespent, query_term, ip_addr, timestamp,
> >        estimated_revenue, page_info, page_links);
> > B = foreach A generate user, (int)action as action, (map[])page_info as
> > page_info,
> >    flatten((bag{tuple(map[])})page_links) as page_links;
> > C = foreach B generate user,
> >    (action == 1 ? page_info#'a' : page_links#'b') as header;
> > D = group C by user parallel 40;
> > E = foreach D generate group, COUNT(C) as cnt;
> > store E into 'L1out';
> >
> > Best,
> > Hui
> >
>
+
Andrew Clegg 2011-10-12, 22:47
+
Dmitriy Ryaboy 2011-10-13, 00:41
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB