Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> how to control the number of mappers?


Copy link to this message
-
Re: how to control the number of mappers?
Prashant:

thanks.

by "reducing the block size", do you mean split size ? ---- block size
is fixed on a hadoop hdfs.

my application is not really data heavy, each line of input takes a
long while to process. as a result, the input size is small, but total
processing time is long, and the potential parallelism is high

Yang

On Wed, Jan 11, 2012 at 6:21 PM, Prashant Kommireddi
<[EMAIL PROTECTED]> wrote:
> Hi Yang,
>
> You cannot really control the number of mappers directly (depends on
> input splits), but surely can spawn more mappers in various ways, such
> as reducing the block size or setting pig.splitCombination to false
> (this *might* create more maps).
>
> Level of parallelization depends on how much data the 2 mappers are
> handling. You would not want a lot of maps handling too little data.
> For eg, if your input data set is only a few MB it would not be a good
> idea to have more than 1 or 2 maps.
>
> Thanks,
> Prashant
>
> Sent from my iPhone
>
> On Jan 11, 2012, at 6:13 PM, Yang <[EMAIL PROTECTED]> wrote:
>
>> I have a pig script  that does basically a map-only job:
>>
>> raw = LOAD 'input.txt' ;
>>
>> processed = FOREACH raw GENERATE convert_somehow($1,$2...);
>>
>> store processed into 'output.txt';
>>
>>
>>
>> I have many nodes on my cluster, so I want PIG to process the input in
>> more mappers. but it generates only 2 part-m-xxxxx  files, i.e.
>> using 2 mappers.
>>
>> in hadoop job it's possible to pass mapper count and
>> -Dmapred.min.split.size= ,  would this also work for PIG? the PARALLEL
>> keyword only works for reducers
>>
>>
>> Thanks
>> Yang