Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> how to fine tuning my map reduce job that is generating a lot of intermediate key-value pairs (a lot of I/O operations)


Copy link to this message
-
Re: how to fine tuning my map reduce job that is generating a lot of intermediate key-value pairs (a lot of I/O operations)
Jane,
       From my first look, properties that can help you could be
- Increase io sort factor to 100
- Increase io.sort.mb to 512Mb
- increase map task heap size to 2GB.

If the task still stalls, try providing lesser input for each mapper.

Regards
Bejoy KS

On Tue, Apr 3, 2012 at 2:08 PM, Jane Wayne <[EMAIL PROTECTED]> wrote:

> i have a map reduce job that is generating a lot of intermediate key-value
> pairs. for example, when i am 1/3 complete with my map phase, i may have
> generated over 130,000,000 output records (which is about 9 gigabytes). to
> get to the 1/3 complete mark is very fast (less than 10 minutes), but at
> the 1/3 complete mark, it seems to stall. when i look at the counter logs,
> i do not see any logging of spilling yet. however, on the web job UI, i see
> that FILE_BYTES_WRITTEN and Spilled Records keeps increasing. needless to
> say, i have to dig deeper to see what is going on.
>
> my question is, how do i fine tune my map reduce job with the above
> properties? namely, the property of generating a lot of intermediate
> key-value pairs? it seems the I/O operations are negatively impacting the
> job speed. there are so many map- and reduce-side tuning properties (see
> Tom White, Hadoop, 2nd edition, pp 181-182), i am a little unsure about
> just how to approach the tuning parameters. since the slow down is
> happening during the map-phase/task, i assume i should narrow down on the
> map-side tuning properties.
>
> by the way, i am using the CPU-intensive c1.medium instances of amazon web
> service's (AWS) elastic map reduce (EMR) on hadoop v0.20. a compute node
> has 2 mappers, 1 reducers, and 384 MB JVM memory per task. this instance
> type is documented to have moderate I/O performance.
>
> any help on fine tuning my particular map reduce job is appreciated.
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB