Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> guessing number of reducers.


+
jamal sasha 2012-11-21, 16:38
+
Bejoy KS 2012-11-21, 16:50
+
Kartashov, Andy 2012-11-21, 17:49
+
Bejoy KS 2012-11-21, 18:21
Copy link to this message
-
Re: guessing number of reducers.
Thanks for the input guys. This helps alot
:)

On Wednesday, November 21, 2012, Bejoy KS <[EMAIL PROTECTED]> wrote:
> Hi Andy
>
> It is usually so because if you have more reduce tasks than the reduce
slots in your cluster then a few of the reduce tasks will be in queue
waiting for its turn. So it is better to keep the num of reduce tasks
slightly less than the reduce task capacity so that all reduce tasks run at
once in parallel.
>
> But in some cases each reducer can process only certain volume of data
due to some constraints, like data beyond a certain limit may lead to OOMs.
In such cases you may need to configure the number of reducers totally
based on your data and not based on slots.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: "Kartashov, Andy" <[EMAIL PROTECTED]>
> Date: Wed, 21 Nov 2012 17:49:50 +0000
> To: [EMAIL PROTECTED]<[EMAIL PROTECTED]>; [EMAIL PROTECTED]
<[EMAIL PROTECTED]>
> Subject: RE: guessing number of reducers.
>
> Bejoy,
>
>
>
> I’ve read somethere about keeping number of mapred.reduce.tasks below the
reduce task capcity. Here is what I just tested:
>
>
>
> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>
>
>
> 1 Reducer   – 22mins
>
> 4 Reducers – 11.5mins
>
> 8 Reducers – 5mins
>
> 10 Reducers – 7mins
>
> 12 Reducers – 6:5mins
>
> 16 Reducers – 5.5mins
>
>
>
> 8 Reducers have won the race. But Reducers at the max capacity was very
clos. J
>
>
>
> AK47
>
>
>
>
>
> From: Bejoy KS [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, November 21, 2012 11:51 AM
> To: [EMAIL PROTECTED]
> Subject: Re: guessing number of reducers.
>
>
>
> Hi Sasha
>
> In general the number of reduce tasks is chosen mainly based on the data
volume to reduce phase. In tools like hive and pig by default for every 1GB
of map output there will be a reducer. So if you have 100 gigs of map
output then 100 reducers.
> If your tasks are more CPU intensive then you need lesser volume of data
per reducer for better performance results.
>
> In general it is better to have the number of reduce tasks slightly less
than the number of available reduce slots in the cluster.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
> ________________________________
>
> From: jamal sasha <[EMAIL PROTECTED]>
>
> Date: Wed, 21 Nov 2012 11:38:38 -0500
>
> To: [EMAIL PROTECTED]<[EMAIL PROTECTED]>
+
Mohammad Tariq 2012-11-21, 18:04
+
Manoj Babu 2012-11-21, 17:58
+
Bejoy KS 2012-11-21, 18:34
+
Manoj Babu 2012-11-22, 04:45
+
Kartashov, Andy 2012-11-21, 16:43