Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> max 1 mapper per node


+
Radim Kolar 2012-04-26, 17:56
+
Robert Evans 2012-04-26, 18:07
+
Radim Kolar 2012-04-27, 05:38
+
Robert Evans 2012-04-27, 15:30
+
Radim Kolar 2012-05-03, 09:19
+
Radim Kolar 2012-05-03, 12:59
+
Arun C Murthy 2012-05-09, 18:33
+
Jeffrey Buell 2012-05-09, 19:36
+
Arun C Murthy 2012-05-10, 20:26
Copy link to this message
-
RE: max 1 mapper per node
I have the right #slots to fill up memory across the cluster, and all those slots are filled with tasks. The problem I ran into was that the maps grabbed all the slots initially and the reduces had a hard time getting started.  As maps finished, more maps were started and only rarely was a reduce started.  I assume this behavior occurred because I had ~4000 map tasks in the queue, but only ~100 reduce tasks.  If the scheduler lumps maps and reduces together, then whenever a slot opens up it will almost surely be taken by a map task.  To get good performance I need all reduce tasks started early on, and have only map tasks compete for open slots.  Other apps may need different priorities between maps and reduces.  In any case, I don't understand how treating maps and reduces the same is workable.

Jeff

From: Arun C Murthy [mailto:[EMAIL PROTECTED]]
Sent: Thursday, May 10, 2012 1:27 PM
To: [EMAIL PROTECTED]
Subject: Re: max 1 mapper per node

For terasort you want to fill up your entire cluster with maps/reduces as fast as you can to get the best performance.

Just play with #slots.

Arun

On May 9, 2012, at 12:36 PM, Jeffrey Buell wrote:
Not to speak for Radim, but what I'm trying to achieve is performance at least as good as 0.20 for all cases.  That is, no regressions.  For something as simple as terasort, I don't think that is possible without being able to specify the max number of mappers/reducers per node.  As it is, I see slowdowns as much as 2X.  Hopefully I'm wrong and somebody will straighten me out.  But if I'm not, adding such a feature won't lead to bad behavior of any kind since the default could be set to unlimited and thus have no effect whatsoever.

I should emphasize that I support the goal of greater automation since Hadoop has way too many parameters and is so hard to tune.  Just not at the expense of performance regressions.

Jeff
We've been against these 'features' since it leads to very bad behaviour across the cluster with multiple apps/users etc.

What is your use-case i.e. what are you trying to achieve with this?

thanks,
Arun

On May 3, 2012, at 5:59 AM, Radim Kolar wrote:

if plugin system for AM is overkill, something simpler can be made like:

maximum number of mappers per node
maximum number of reducers per node

maximum percentage of non data local tasks
maximum percentage of rack local tasks

and set this in job properties.

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

+
Radim Kolar 2012-05-10, 11:56
+
GUOJUN Zhu 2012-05-10, 12:53
+
Robert Evans 2012-05-10, 13:29
+
Radim Kolar 2012-05-14, 15:10
+
Robert Evans 2012-05-09, 16:10
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB