Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Number of reduce tasks

Copy link to this message
Re: Number of reduce tasks
Also, please note default number of reducers are based on input dataset. In
the basic case, Pig will "automatically" spawn a reducer for each GB of
input, so if your input dataset size is 500 GB you should see 500 reducers
being spawned (though this is excessive in a lot of cases).

This document talks about parallelism

Setting the right number of reducers (PARALLEL or set default_parallel)
depends on what you are doing with it. If the reducer is CPU intensive (may
be a complex UDF running on reducer side), you would probably spawn more
reducers. Otherwise (in most cases), the suggestion in the doc (1 GB per
reducer) holds good for regular aggregations (SUM, COUNT..).
   1. Take a look at Reduce Shuffle Bytes for the job on JobTracker
   2. Re-run the job by setting default_parallel to -> 1 reducer per 1 GB
   of reduce shuffle bytes and see if it performs well
   3. If not, adjust it according to your Reducer heap size. More the heap,
   less is the data spilled to disk.

There are a few more properties on the Reduce side (buffer size etc) but
that probably is not required to start with.


On Fri, Jun 1, 2012 at 11:49 AM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:

> Pankaj,
> What version of pig are you using? In later versions of pig, it should have
> some logic around automatically setting parallelisms (though sometimes
> these heuristics will be wrong).
> There are also some operations which will force you to use 1 reducer. It
> depends on what your script is doing.
> 2012/6/1 Pankaj Gupta <[EMAIL PROTECTED]>
> > Hi,
> >
> > I just realized that one of my large scale pig jobs that has 100K map
> jobs
> > actually only has one reduce task. Reading the documentation I see that
> the
> > number of reduce tasks is defined by the PARALLEL clause whose default
> > value is 1. I have a few questions around this:
> >
> > # Why is the default value of reduce tasks 1?
> > # (Related to first question) Why aren't reduce tasks parallelized
> > automatically in Pig?
> > # How do I choose a good value of reduce tasks for my pig jobs?
> >
> > Thanks in Advance,
> > Pankaj