Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - How to control a number of reducers in Apache Pig


Copy link to this message
-
Re: How to control a number of reducers in Apache Pig
Jonathan Coveney 2013-03-15, 10:15
The script you posted wouldn't have any reducers, so it wouldn't matter.
It's a map only job.
2013/3/15 <[EMAIL PROTECTED]>

> Dear Apache Pig Users,
>
> It is easy to control a number of reducers in JOIN, GROUP, COGROUP,
> etc. statements by a general "set default_parallel $NUM" command or
> "parallel $NUM" info in the end of line.
>
> However, I am interested in controlling number of reducers in a
> foreach statement.
> The case is as follows:
> * on CDH 4.0.1. with Pig 0.9.2.
> * read one sequence file (of many equivalent files) of about 400GB,
> * proceed each element in UDF __using as many reducers as possible__
> * store the results
>
> Apache Pig script implementing this case -- which gives __only one__
> reducer -- is below:
> ------------------------------------------------
> SET default_parallel 16;
> REGISTER myjar.jar;
> input_pairs = LOAD '$input' USING
> pl.example.MySequenceFileLoader('org.apache.hadoop.io.BytesWritable',
> 'org.apache.hadoop.io.BytesWritable') as (key:chararray,
> value:bytearray);
> input_protos  = FOREACH input_pairs GENERATE
> FLATTEN(pl.example.ReadProtobuf(value));
> output_protos = FOREACH input_protos GENERATE
> FLATTEN(pl.example.XMLGenerator(*));
> STORE output_protos INTO '$output' USING PigStorage();
> ------------------------------------------------
>
> As far as I know "set mapred.reduce.tasks 5" can only limit a max
> number of reducers
>
> Could you give me some advice? Am I missing something?
>