Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> how to control the number of mappers?


Copy link to this message
-
Re: how to control the number of mappers?
Prashant:

I tried splitting the input files, yes that worked, and multiple mappers
were indeed created.

but then I would have to create a separate stage simply to split the input
files, so that is a bit cumbersome. it would be nice if there is some
control to directly limit map file input size etc.

Thanks
Yang

On Wed, Jan 11, 2012 at 7:46 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:

> By block size I mean the actual HDFS block size. Based on your requirement
> it seems like the input files are extremely small and reducing the block
> size is not an option.
>
> Specifying "mapred.min.split.size" would not work for both Hadoop/Java MR
> and Pig. Hadoop only picks the maximum of (minSplitSize, blockSize).
>
> Your job is more CPU intensive than I/O. I can think of splitting your
> files into multiple input files (equal to # of map tasks on your cluster)
> and turning off splitCombination (pig.splitCombination=false). Though this
> is generally a terrible MR practice!
>
> Another thing you could try is to give more memory to your map tasks by
> increasing "mapred.child.java.opts" to a higher value.
>
> Thanks,
> Prashant
>
>
> On Wed, Jan 11, 2012 at 6:27 PM, Yang <[EMAIL PROTECTED]> wrote:
>
> > Prashant:
> >
> > thanks.
> >
> > by "reducing the block size", do you mean split size ? ---- block size
> > is fixed on a hadoop hdfs.
> >
> > my application is not really data heavy, each line of input takes a
> > long while to process. as a result, the input size is small, but total
> > processing time is long, and the potential parallelism is high
> >
> > Yang
> >
> > On Wed, Jan 11, 2012 at 6:21 PM, Prashant Kommireddi
> > <[EMAIL PROTECTED]> wrote:
> > > Hi Yang,
> > >
> > > You cannot really control the number of mappers directly (depends on
> > > input splits), but surely can spawn more mappers in various ways, such
> > > as reducing the block size or setting pig.splitCombination to false
> > > (this *might* create more maps).
> > >
> > > Level of parallelization depends on how much data the 2 mappers are
> > > handling. You would not want a lot of maps handling too little data.
> > > For eg, if your input data set is only a few MB it would not be a good
> > > idea to have more than 1 or 2 maps.
> > >
> > > Thanks,
> > > Prashant
> > >
> > > Sent from my iPhone
> > >
> > > On Jan 11, 2012, at 6:13 PM, Yang <[EMAIL PROTECTED]> wrote:
> > >
> > >> I have a pig script  that does basically a map-only job:
> > >>
> > >> raw = LOAD 'input.txt' ;
> > >>
> > >> processed = FOREACH raw GENERATE convert_somehow($1,$2...);
> > >>
> > >> store processed into 'output.txt';
> > >>
> > >>
> > >>
> > >> I have many nodes on my cluster, so I want PIG to process the input in
> > >> more mappers. but it generates only 2 part-m-xxxxx  files, i.e.
> > >> using 2 mappers.
> > >>
> > >> in hadoop job it's possible to pass mapper count and
> > >> -Dmapred.min.split.size= ,  would this also work for PIG? the PARALLEL
> > >> keyword only works for reducers
> > >>
> > >>
> > >> Thanks
> > >> Yang
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB