Pig >> mail # user >> how to control the number of mappers?

Re: how to control the number of mappers?


by "reducing the block size", do you mean split size ? ---- block size
is fixed on a hadoop hdfs.

my application is not really data heavy, each line of input takes a
long while to process. as a result, the input size is small, but total
processing time is long, and the potential parallelism is high


On Wed, Jan 11, 2012 at 6:21 PM, Prashant Kommireddi
> Hi Yang,
> You cannot really control the number of mappers directly (depends on
> input splits), but surely can spawn more mappers in various ways, such
> as reducing the block size or setting pig.splitCombination to false
> (this *might* create more maps).
> Level of parallelization depends on how much data the 2 mappers are
> handling. You would not want a lot of maps handling too little data.
> For eg, if your input data set is only a few MB it would not be a good
> idea to have more than 1 or 2 maps.
> Thanks,
> Prashant
> On Jan 11, 2012, at 6:13 PM, Yang <[EMAIL PROTECTED]> wrote:
>> I have a pig script  that does basically a map-only job:
>> raw = LOAD 'input.txt' ;
>> processed = FOREACH raw GENERATE convert_somehow($1,$2...);
>> store processed into 'output.txt';
>> I have many nodes on my cluster, so I want PIG to process the input in
>> more mappers. but it generates only 2 part-m-xxxxx  files, i.e.
>> using 2 mappers.
>> in hadoop job it's possible to pass mapper count and
>> -Dmapred.min.split.size= ,  would this also work for PIG? the PARALLEL
>> keyword only works for reducers
>> Thanks
>> Yang