Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Number of reduce tasks


Copy link to this message
-
Re: Number of reduce tasks
Right. And the documentation provides a list of operations that can be
parallelized.

On Jun 1, 2012, at 4:50 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:

> That being said, some operators such as "group all" and limit, do require using only 1 reducer, by nature. So it depends on what your script is doing.
>
> On Jun 1, 2012, at 12:26 PM, Prashant Kommireddi <[EMAIL PROTECTED]> wrote:
>
>> Automatic Heuristic works the same in 0.9.1
>> http://pig.apache.org/docs/r0.9.1/perf.html#parallel, but you might be
>> better off setting it manually looking at job tracker counters.
>>
>> You should be fine with using PARALLEL for any of the operators mentioned
>> on the doc.
>>
>> -Prashant
>>
>>
>> On Fri, Jun 1, 2012 at 12:19 PM, Pankaj Gupta <[EMAIL PROTECTED]> wrote:
>>
>>> Hi Prashant,
>>>
>>> Thanks for the tips. We haven't moved to Pig 0.10.0 yet, but seems like a
>>> very useful upgrade. For the moment though it seems that I should be able
>>> to use the 1GB per reducer heuristic and specify the number of reducers in
>>> Pig 0.9.1 by using the PARALLEL clause in the Pig script. Does this sound
>>> right?
>>>
>>> Thanks,
>>> Pankaj
>>>
>>>
>>> On Jun 1, 2012, at 12:03 PM, Prashant Kommireddi wrote:
>>>
>>>> Also, please note default number of reducers are based on input dataset.
>>> In
>>>> the basic case, Pig will "automatically" spawn a reducer for each GB of
>>>> input, so if your input dataset size is 500 GB you should see 500
>>> reducers
>>>> being spawned (though this is excessive in a lot of cases).
>>>>
>>>> This document talks about parallelism
>>>> http://pig.apache.org/docs/r0.10.0/perf.html#parallel
>>>>
>>>> Setting the right number of reducers (PARALLEL or set default_parallel)
>>>> depends on what you are doing with it. If the reducer is CPU intensive
>>> (may
>>>> be a complex UDF running on reducer side), you would probably spawn more
>>>> reducers. Otherwise (in most cases), the suggestion in the doc (1 GB per
>>>> reducer) holds good for regular aggregations (SUM, COUNT..).
>>>>
>>>>
>>>> 1. Take a look at Reduce Shuffle Bytes for the job on JobTracker
>>>> 2. Re-run the job by setting default_parallel to -> 1 reducer per 1 GB
>>>> of reduce shuffle bytes and see if it performs well
>>>> 3. If not, adjust it according to your Reducer heap size. More the
>>> heap,
>>>> less is the data spilled to disk.
>>>>
>>>> There are a few more properties on the Reduce side (buffer size etc) but
>>>> that probably is not required to start with.
>>>>
>>>> Thanks,
>>>>
>>>> Prashant
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jun 1, 2012 at 11:49 AM, Jonathan Coveney <[EMAIL PROTECTED]
>>>> wrote:
>>>>
>>>>> Pankaj,
>>>>>
>>>>> What version of pig are you using? In later versions of pig, it should
>>> have
>>>>> some logic around automatically setting parallelisms (though sometimes
>>>>> these heuristics will be wrong).
>>>>>
>>>>> There are also some operations which will force you to use 1 reducer. It
>>>>> depends on what your script is doing.
>>>>>
>>>>> 2012/6/1 Pankaj Gupta <[EMAIL PROTECTED]>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I just realized that one of my large scale pig jobs that has 100K map
>>>>> jobs
>>>>>> actually only has one reduce task. Reading the documentation I see that
>>>>> the
>>>>>> number of reduce tasks is defined by the PARALLEL clause whose default
>>>>>> value is 1. I have a few questions around this:
>>>>>>
>>>>>> # Why is the default value of reduce tasks 1?
>>>>>> # (Related to first question) Why aren't reduce tasks parallelized
>>>>>> automatically in Pig?
>>>>>> # How do I choose a good value of reduce tasks for my pig jobs?
>>>>>>
>>>>>> Thanks in Advance,
>>>>>> Pankaj
>>>>>
>>>
>>>