Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> How can I set the mapper number for pig script?

Copy link to this message
Re: How can I set the mapper number for pig script?
hi Sheng:

I had exactly the same problem as you did.

right now with hadoop 0.20 and above you can't do it anymore, because the
new mapreduce.lib.input.FileInputFormat disabled the original
mapred.map.tasks control to compute the goalSize in
getSplits() method.  ---- the old mapred.FileInputFormat class had this

I submitted https://issues.apache.org/jira/browse/HADOOP-8503 to add back
this control
because pig actually compiles some hadoop classes into its own jar ,
including this FileInputFormat class, you could actually work around this
by patching your own hadoop jar, then build pig with this jar, and then use
your re-built pig  in production. you need to make sure to use the full pig
jar instead of the pig-withouthadoop.jar.

you can also kind of achieve part of the same goal by setting
mapreduces.max.split.size, but this is rather inflexible, and if your pig
script generates several MR jobs, the same split size will hold for all the
jobs, which may not be ideal, if one stage consumes a lot more input data
than another.

On Sat, Jun 23, 2012 at 1:48 PM, Sheng Guo <[EMAIL PROTECTED]> wrote:

> Thanks for all your help.
> My pig script may have some cpu-intensive job like nlp processing, so it
> would be helpful if I have multiple mappers running. Correct me if I am
> wrong.
> Thanks,
> Sheng
> On Sat, Jun 23, 2012 at 9:40 AM, Scott Foster <[EMAIL PROTECTED]
> >wrote:
> > You can also turn off split combination completely and then the number
> > of mappers will equal the number of blocks
> > SET pig.noSplitCombination false;
> >
> > Adding mappers may not make your process run faster since the time to
> > read the data may be less than the overhead of creating a new JVM for
> > each map task.
> >
> > scott.
> >