Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> how to fine tuning my map reduce job that is generating a lot of intermediate key-value pairs (a lot of I/O operations)


Copy link to this message
-
how to fine tuning my map reduce job that is generating a lot of intermediate key-value pairs (a lot of I/O operations)
i have a map reduce job that is generating a lot of intermediate key-value
pairs. for example, when i am 1/3 complete with my map phase, i may have
generated over 130,000,000 output records (which is about 9 gigabytes). to
get to the 1/3 complete mark is very fast (less than 10 minutes), but at
the 1/3 complete mark, it seems to stall. when i look at the counter logs,
i do not see any logging of spilling yet. however, on the web job UI, i see
that FILE_BYTES_WRITTEN and Spilled Records keeps increasing. needless to
say, i have to dig deeper to see what is going on.

my question is, how do i fine tune my map reduce job with the above
properties? namely, the property of generating a lot of intermediate
key-value pairs? it seems the I/O operations are negatively impacting the
job speed. there are so many map- and reduce-side tuning properties (see
Tom White, Hadoop, 2nd edition, pp 181-182), i am a little unsure about
just how to approach the tuning parameters. since the slow down is
happening during the map-phase/task, i assume i should narrow down on the
map-side tuning properties.

by the way, i am using the CPU-intensive c1.medium instances of amazon web
service's (AWS) elastic map reduce (EMR) on hadoop v0.20. a compute node
has 2 mappers, 1 reducers, and 384 MB JVM memory per task. this instance
type is documented to have moderate I/O performance.

any help on fine tuning my particular map reduce job is appreciated.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB