Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Number of reduce tasks


Copy link to this message
-
Re: Number of reduce tasks
Automatic Heuristic works the same in 0.9.1
http://pig.apache.org/docs/r0.9.1/perf.html#parallel, but you might be
better off setting it manually looking at job tracker counters.

You should be fine with using PARALLEL for any of the operators mentioned
on the doc.

-Prashant
On Fri, Jun 1, 2012 at 12:19 PM, Pankaj Gupta <[EMAIL PROTECTED]> wrote:

> Hi Prashant,
>
> Thanks for the tips. We haven't moved to Pig 0.10.0 yet, but seems like a
> very useful upgrade. For the moment though it seems that I should be able
> to use the 1GB per reducer heuristic and specify the number of reducers in
> Pig 0.9.1 by using the PARALLEL clause in the Pig script. Does this sound
> right?
>
> Thanks,
> Pankaj
>
>
> On Jun 1, 2012, at 12:03 PM, Prashant Kommireddi wrote:
>
> > Also, please note default number of reducers are based on input dataset.
> In
> > the basic case, Pig will "automatically" spawn a reducer for each GB of
> > input, so if your input dataset size is 500 GB you should see 500
> reducers
> > being spawned (though this is excessive in a lot of cases).
> >
> > This document talks about parallelism
> > http://pig.apache.org/docs/r0.10.0/perf.html#parallel
> >
> > Setting the right number of reducers (PARALLEL or set default_parallel)
> > depends on what you are doing with it. If the reducer is CPU intensive
> (may
> > be a complex UDF running on reducer side), you would probably spawn more
> > reducers. Otherwise (in most cases), the suggestion in the doc (1 GB per
> > reducer) holds good for regular aggregations (SUM, COUNT..).
> >
> >
> >   1. Take a look at Reduce Shuffle Bytes for the job on JobTracker
> >   2. Re-run the job by setting default_parallel to -> 1 reducer per 1 GB
> >   of reduce shuffle bytes and see if it performs well
> >   3. If not, adjust it according to your Reducer heap size. More the
> heap,
> >   less is the data spilled to disk.
> >
> > There are a few more properties on the Reduce side (buffer size etc) but
> > that probably is not required to start with.
> >
> > Thanks,
> >
> > Prashant
> >
> >
> >
> >
> > On Fri, Jun 1, 2012 at 11:49 AM, Jonathan Coveney <[EMAIL PROTECTED]
> >wrote:
> >
> >> Pankaj,
> >>
> >> What version of pig are you using? In later versions of pig, it should
> have
> >> some logic around automatically setting parallelisms (though sometimes
> >> these heuristics will be wrong).
> >>
> >> There are also some operations which will force you to use 1 reducer. It
> >> depends on what your script is doing.
> >>
> >> 2012/6/1 Pankaj Gupta <[EMAIL PROTECTED]>
> >>
> >>> Hi,
> >>>
> >>> I just realized that one of my large scale pig jobs that has 100K map
> >> jobs
> >>> actually only has one reduce task. Reading the documentation I see that
> >> the
> >>> number of reduce tasks is defined by the PARALLEL clause whose default
> >>> value is 1. I have a few questions around this:
> >>>
> >>> # Why is the default value of reduce tasks 1?
> >>> # (Related to first question) Why aren't reduce tasks parallelized
> >>> automatically in Pig?
> >>> # How do I choose a good value of reduce tasks for my pig jobs?
> >>>
> >>> Thanks in Advance,
> >>> Pankaj
> >>
>
>