Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> How to control a number of reducers in Apache Pig


+
tsunami7@... 2013-03-15, 09:58
Copy link to this message
-
Re: How to control a number of reducers in Apache Pig
The script you posted wouldn't have any reducers, so it wouldn't matter.
It's a map only job.
2013/3/15 <[EMAIL PROTECTED]>

> Dear Apache Pig Users,
>
> It is easy to control a number of reducers in JOIN, GROUP, COGROUP,
> etc. statements by a general "set default_parallel $NUM" command or
> "parallel $NUM" info in the end of line.
>
> However, I am interested in controlling number of reducers in a
> foreach statement.
> The case is as follows:
> * on CDH 4.0.1. with Pig 0.9.2.
> * read one sequence file (of many equivalent files) of about 400GB,
> * proceed each element in UDF __using as many reducers as possible__
> * store the results
>
> Apache Pig script implementing this case -- which gives __only one__
> reducer -- is below:
> ------------------------------------------------
> SET default_parallel 16;
> REGISTER myjar.jar;
> input_pairs = LOAD '$input' USING
> pl.example.MySequenceFileLoader('org.apache.hadoop.io.BytesWritable',
> 'org.apache.hadoop.io.BytesWritable') as (key:chararray,
> value:bytearray);
> input_protos  = FOREACH input_pairs GENERATE
> FLATTEN(pl.example.ReadProtobuf(value));
> output_protos = FOREACH input_protos GENERATE
> FLATTEN(pl.example.XMLGenerator(*));
> STORE output_protos INTO '$output' USING PigStorage();
> ------------------------------------------------
>
> As far as I know "set mapred.reduce.tasks 5" can only limit a max
> number of reducers
>
> Could you give me some advice? Am I missing something?
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB