Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Idle nodes with terasort and MRv2/YARN (0.23.1)


+
Trevor Robinson 2012-05-29, 21:33
+
Arun C Murthy 2012-05-29, 23:21
+
Trevor Robinson 2012-05-30, 20:38
Copy link to this message
-
Re: Idle nodes with terasort and MRv2/YARN (0.23.1)
Trevor,

On May 30, 2012, at 1:38 PM, Trevor Robinson wrote:

> Jeff:
>
> Thanks for the corroboration and advice. I can't retreat to 0.20, and
> must forge ahead with 2.0, so I'll share any progress.
>
> Arun:
>
> I haven't set the minimum container size. Do you know the default? Is
> there a way to easily find defaults (more complete/reliable than
> docs)?

The default is way too low (128M) which is non-optimal.

I've opened https://issues.apache.org/jira/browse/MAPREDUCE-4316 to fix it.

>
> Thanks, I'll give that a try. How does minimum container size relate
> to settings mapreduce.map.memory.mb? Would it essentially raise my
> 768M map memory allocation to 1G? Does CapacityScheduler need any
> additional configuration to function optimally? BTW, what is the
> default scheduler?

The min. container size should be >= mapreduce.map.memory.mb.

If you have a heap size of 768M for your maps, it should be perfectly ok to have min-container size as 1024M.

Hmm... I'd also look into bumping mapreduce.reduce.memory.mb to be 2*min-container-size i.e. use 2 containers.

The default scheduler is FifoScheduler which isn't as mature as CS (which is what we use for tuning etc., see http://hortonworks.com/blog/delivering-on-hadoop-next-benchmarking-performance/ for more details).

>
> I should have MAPREDUCE-3641. I'm using 0.23.1 with CDH4b2 patches
> (and a few Java 7/Ubuntu 12.04 build patches). How does 2.0.0-alpha
> compare to 0.23.1?
>

I'm not familiar with patchsets on CDH, but hadoop-2.0.0-alpha is a significant improvement on hadoop-0.23.1...

> If there's anything I can do to assist with the issue of spreading out
> map tasks, please let me know. Is there a JIRA issue for it (or if
> not, should there be)?
>

For smaller clusters https://issues.apache.org/jira/browse/MAPREDUCE-3210 could help.

> Incidentally, my current benchmarking work on x86 is only a training
> ground and baseline before moving onto ARM-based systems, which have
> 4GB RAM and generally fewer, smaller (2.5" form factor) disks per
> node. It sounds like the smaller RAM will force better distribution,
> but the disk capacity/utilization situation will be more severe.
>

Right, smaller RAM should force better distribution.

Love to get more f/b from your ARM work, thanks!

Arun

> Thanks,
> Trevor
>
> On Tue, May 29, 2012 at 6:21 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote:
>> What is the minimum container size? i.e.
>> yarn.scheduler.minimum-allocation-mb.
>>
>> I'd bump it up to at least 1G and use the CapacityScheduler for performance
>> tests:
>> http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
>>
>> In case of teragen, the job has no locality at all (since it's just
>> generating data from 'random' input-splits) and hence you are getting them
>> stuck on fewer nodes since you have so many containers on each node.
>>
>> The reduces should be better spread if you are using CapacityScheduler and
>> have https://issues.apache.org/jira/browse/MAPREDUCE-3641 in your build i.e.
>> hadoop-0.23.1 or hadoop-2.0.0-alpha (I'd use the latter).
>>
>> Also, FYI, currently the CS makes the tradeoff that node-locality is almost
>> same as rack-locality and hence you might see maps not spread out for
>> terasort. I'll fix that one soon.
>>
>> hth,
>> Arun
>>
>> On May 29, 2012, at 2:33 PM, Trevor Robinson wrote:
>>
>> Hello,
>>
>> I'm trying to tune terasort on a small cluster (4 identical slave
>> nodes w/ 4 disks and 16GB RAM each), but I'm having problems with very
>> uneven load.
>>
>> For teragen, I specify 24 mappers, but for some reason, only 2 nodes
>> out of 4 run them all, even though the web UI (for both YARN and HDFS)
>> shows all 4 nodes available. Similarly, I specify 16 reducers for
>> terasort, but the reducers seem to run on 3 nodes out of 4. Do I have
>> something configured wrong, or does the scheduler not attempt to
>> spread out the load? In addition to performing sub-optimally, this

Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/
+
Trevor 2012-06-29, 22:48
+
Robert Evans 2012-05-31, 14:22
+
Jeffrey Buell 2012-05-29, 22:10
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB