Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> how to fine tuning my map reduce job that is generating a lot of intermediate key-value pairs (a lot of I/O operations)


+
Jane Wayne 2012-04-03, 08:38
+
Bejoy Ks 2012-04-03, 11:48
Copy link to this message
-
Re: how to fine tuning my map reduce job that is generating a lot of intermediate key-value pairs (a lot of I/O operations)
i don't have the option of setting the map heap size to 2 GB since my
real environment is AWS EMR and the constraints are set.

http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html this
link is where i am currently reading on the meaning of io.sort.factor
and io.sort.mb.

it seems io.sort.mb tunes the map tasks and io.sort.factor tunes the
shuffle/reduce task. am i correct to say then that io.sort.factor is
not relevant here (yet, anways)? since i don't really make it to the
reduce phase (except for only a very small data size).

in that link above, here is the description for, io.sort.mb:  The
cumulative size of the serialization and accounting buffers storing
records emitted from the map, in megabytes. there's a paragraph above
the table that is value is simply the threshold that triggers a sort
and spill to the disk. furthermore, it says, "If either buffer fills
completely while the spill is in progress, the map thread will block,"
which is what i believe is happening in my case.

this sentence concerns me, "Minimizing the number of spills to disk
can decrease map time, but a larger buffer also decreases the memory
available to the mapper." to minimize the number of spills, you need a
larger buffer; however, this statement seems to suggest to NOT
minimize the number of spills; a) you will not decrease map time, b)
you will not decrease the memory available to the mapper. so, in your
advice below, you say to increase, but i may actually want to decrease
the value for io.sort.mb. (if i understood the documentation
correctly, ????)

it seems these three map tuning parameters, io.sort.mb,
io.sort.record.percent, and io.sort.spill.percent are a pain-point
trading off between speed and memory. to me, if you set them high,
more serialized data + metadata are stored in memory before a spill
(an I/O operation is performed). you also get less merges (less I/O
operation?), but the negatives are blocking map operations and more
memory requirements. if you set them low, there are more frequent
spills (more I/O operations), but less memory requirements. it just
seems like no matter what you do, you are stuck: you may stall the
mapper if the values are high because of the amount of time required
to spill an enormous amount of data; you may stall the mapper if the
values are low because of the amount of I/O operations required
(spill/merge).

i must be understanding something wrong here because everywhere i
read, hadoop is supposed to be #1 at sorting. but here, in dealing
with the intermediary key-value pairs, in the process of sorting,
mappers can stall for any number of reasons.

does anyone know any competitive dynamic hadoop clustering service
like AWS EMR? the reason why i ask is because AWS EMR does not use
HDFS (it uses S3), and therefore, data locality is not possible. also,
i have read the TCP protocol is not efficient for network transfers;
if the S3 node and task nodes are far, this distance will certainly
exacerbate the situation of slow speed. it seems there are a lot of
factors working against me.

any help is appreciated.

On Tue, Apr 3, 2012 at 7:48 AM, Bejoy Ks <[EMAIL PROTECTED]> wrote:
>
> Jane,
>       From my first look, properties that can help you could be
> - Increase io sort factor to 100
> - Increase io.sort.mb to 512Mb
> - increase map task heap size to 2GB.
>
> If the task still stalls, try providing lesser input for each mapper.
>
> Regards
> Bejoy KS
>
> On Tue, Apr 3, 2012 at 2:08 PM, Jane Wayne <[EMAIL PROTECTED]> wrote:
>
> > i have a map reduce job that is generating a lot of intermediate key-value
> > pairs. for example, when i am 1/3 complete with my map phase, i may have
> > generated over 130,000,000 output records (which is about 9 gigabytes). to
> > get to the 1/3 complete mark is very fast (less than 10 minutes), but at
> > the 1/3 complete mark, it seems to stall. when i look at the counter logs,
> > i do not see any logging of spilling yet. however, on the web job UI, i see
+
Serge Blazhievsky 2012-04-04, 17:17
+
Jane Wayne 2012-04-05, 03:15
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB