Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Number of reduce tasks


Copy link to this message
-
Re: Number of reduce tasks
Dmitriy Ryaboy 2012-06-01, 23:49
That being said, some operators such as "group all" and limit, do require using only 1 reducer, by nature. So it depends on what your script is doing.

On Jun 1, 2012, at 12:26 PM, Prashant Kommireddi <[EMAIL PROTECTED]> wrote:

> Automatic Heuristic works the same in 0.9.1
> http://pig.apache.org/docs/r0.9.1/perf.html#parallel, but you might be
> better off setting it manually looking at job tracker counters.
>
> You should be fine with using PARALLEL for any of the operators mentioned
> on the doc.
>
> -Prashant
>
>
> On Fri, Jun 1, 2012 at 12:19 PM, Pankaj Gupta <[EMAIL PROTECTED]> wrote:
>
>> Hi Prashant,
>>
>> Thanks for the tips. We haven't moved to Pig 0.10.0 yet, but seems like a
>> very useful upgrade. For the moment though it seems that I should be able
>> to use the 1GB per reducer heuristic and specify the number of reducers in
>> Pig 0.9.1 by using the PARALLEL clause in the Pig script. Does this sound
>> right?
>>
>> Thanks,
>> Pankaj
>>
>>
>> On Jun 1, 2012, at 12:03 PM, Prashant Kommireddi wrote:
>>
>>> Also, please note default number of reducers are based on input dataset.
>> In
>>> the basic case, Pig will "automatically" spawn a reducer for each GB of
>>> input, so if your input dataset size is 500 GB you should see 500
>> reducers
>>> being spawned (though this is excessive in a lot of cases).
>>>
>>> This document talks about parallelism
>>> http://pig.apache.org/docs/r0.10.0/perf.html#parallel
>>>
>>> Setting the right number of reducers (PARALLEL or set default_parallel)
>>> depends on what you are doing with it. If the reducer is CPU intensive
>> (may
>>> be a complex UDF running on reducer side), you would probably spawn more
>>> reducers. Otherwise (in most cases), the suggestion in the doc (1 GB per
>>> reducer) holds good for regular aggregations (SUM, COUNT..).
>>>
>>>
>>>  1. Take a look at Reduce Shuffle Bytes for the job on JobTracker
>>>  2. Re-run the job by setting default_parallel to -> 1 reducer per 1 GB
>>>  of reduce shuffle bytes and see if it performs well
>>>  3. If not, adjust it according to your Reducer heap size. More the
>> heap,
>>>  less is the data spilled to disk.
>>>
>>> There are a few more properties on the Reduce side (buffer size etc) but
>>> that probably is not required to start with.
>>>
>>> Thanks,
>>>
>>> Prashant
>>>
>>>
>>>
>>>
>>> On Fri, Jun 1, 2012 at 11:49 AM, Jonathan Coveney <[EMAIL PROTECTED]
>>> wrote:
>>>
>>>> Pankaj,
>>>>
>>>> What version of pig are you using? In later versions of pig, it should
>> have
>>>> some logic around automatically setting parallelisms (though sometimes
>>>> these heuristics will be wrong).
>>>>
>>>> There are also some operations which will force you to use 1 reducer. It
>>>> depends on what your script is doing.
>>>>
>>>> 2012/6/1 Pankaj Gupta <[EMAIL PROTECTED]>
>>>>
>>>>> Hi,
>>>>>
>>>>> I just realized that one of my large scale pig jobs that has 100K map
>>>> jobs
>>>>> actually only has one reduce task. Reading the documentation I see that
>>>> the
>>>>> number of reduce tasks is defined by the PARALLEL clause whose default
>>>>> value is 1. I have a few questions around this:
>>>>>
>>>>> # Why is the default value of reduce tasks 1?
>>>>> # (Related to first question) Why aren't reduce tasks parallelized
>>>>> automatically in Pig?
>>>>> # How do I choose a good value of reduce tasks for my pig jobs?
>>>>>
>>>>> Thanks in Advance,
>>>>> Pankaj
>>>>
>>
>>