Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - how to control the number of mappers?


Copy link to this message
-
Re: how to control the number of mappers?
Yang 2012-01-17, 20:53
Prashant:

I tried splitting the input files, yes that worked, and multiple mappers
were indeed created.

but then I would have to create a separate stage simply to split the input
files, so that is a bit cumbersome. it would be nice if there is some
control to directly limit map file input size etc.

Thanks
Yang

On Wed, Jan 11, 2012 at 7:46 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:

> By block size I mean the actual HDFS block size. Based on your requirement
> it seems like the input files are extremely small and reducing the block
> size is not an option.
>
> Specifying "mapred.min.split.size" would not work for both Hadoop/Java MR
> and Pig. Hadoop only picks the maximum of (minSplitSize, blockSize).
>
> Your job is more CPU intensive than I/O. I can think of splitting your
> files into multiple input files (equal to # of map tasks on your cluster)
> and turning off splitCombination (pig.splitCombination=false). Though this
> is generally a terrible MR practice!
>
> Another thing you could try is to give more memory to your map tasks by
> increasing "mapred.child.java.opts" to a higher value.
>
> Thanks,
> Prashant
>
>
> On Wed, Jan 11, 2012 at 6:27 PM, Yang <[EMAIL PROTECTED]> wrote:
>
> > Prashant:
> >
> > thanks.
> >
> > by "reducing the block size", do you mean split size ? ---- block size
> > is fixed on a hadoop hdfs.
> >
> > my application is not really data heavy, each line of input takes a
> > long while to process. as a result, the input size is small, but total
> > processing time is long, and the potential parallelism is high
> >
> > Yang
> >
> > On Wed, Jan 11, 2012 at 6:21 PM, Prashant Kommireddi
> > <[EMAIL PROTECTED]> wrote:
> > > Hi Yang,
> > >
> > > You cannot really control the number of mappers directly (depends on
> > > input splits), but surely can spawn more mappers in various ways, such
> > > as reducing the block size or setting pig.splitCombination to false
> > > (this *might* create more maps).
> > >
> > > Level of parallelization depends on how much data the 2 mappers are
> > > handling. You would not want a lot of maps handling too little data.
> > > For eg, if your input data set is only a few MB it would not be a good
> > > idea to have more than 1 or 2 maps.
> > >
> > > Thanks,
> > > Prashant
> > >
> > > Sent from my iPhone
> > >
> > > On Jan 11, 2012, at 6:13 PM, Yang <[EMAIL PROTECTED]> wrote:
> > >
> > >> I have a pig script  that does basically a map-only job:
> > >>
> > >> raw = LOAD 'input.txt' ;
> > >>
> > >> processed = FOREACH raw GENERATE convert_somehow($1,$2...);
> > >>
> > >> store processed into 'output.txt';
> > >>
> > >>
> > >>
> > >> I have many nodes on my cluster, so I want PIG to process the input in
> > >> more mappers. but it generates only 2 part-m-xxxxx  files, i.e.
> > >> using 2 mappers.
> > >>
> > >> in hadoop job it's possible to pass mapper count and
> > >> -Dmapred.min.split.size= ,  would this also work for PIG? the PARALLEL
> > >> keyword only works for reducers
> > >>
> > >>
> > >> Thanks
> > >> Yang
> >
>