Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> how to fine tuning my map reduce job that is generating a lot of intermediate key-value pairs (a lot of I/O operations)

Copy link to this message
Re: how to fine tuning my map reduce job that is generating a lot of intermediate key-value pairs (a lot of I/O operations)
i don't have the option of setting the map heap size to 2 GB since my
real environment is AWS EMR and the constraints are set.

http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html this
link is where i am currently reading on the meaning of io.sort.factor
and io.sort.mb.

it seems io.sort.mb tunes the map tasks and io.sort.factor tunes the
shuffle/reduce task. am i correct to say then that io.sort.factor is
not relevant here (yet, anways)? since i don't really make it to the
reduce phase (except for only a very small data size).

in that link above, here is the description for, io.sort.mb:  The
cumulative size of the serialization and accounting buffers storing
records emitted from the map, in megabytes. there's a paragraph above
the table that is value is simply the threshold that triggers a sort
and spill to the disk. furthermore, it says, "If either buffer fills
completely while the spill is in progress, the map thread will block,"
which is what i believe is happening in my case.

this sentence concerns me, "Minimizing the number of spills to disk
can decrease map time, but a larger buffer also decreases the memory
available to the mapper." to minimize the number of spills, you need a
larger buffer; however, this statement seems to suggest to NOT
minimize the number of spills; a) you will not decrease map time, b)
you will not decrease the memory available to the mapper. so, in your
advice below, you say to increase, but i may actually want to decrease
the value for io.sort.mb. (if i understood the documentation
correctly, ????)

it seems these three map tuning parameters, io.sort.mb,
io.sort.record.percent, and io.sort.spill.percent are a pain-point
trading off between speed and memory. to me, if you set them high,
more serialized data + metadata are stored in memory before a spill
(an I/O operation is performed). you also get less merges (less I/O
operation?), but the negatives are blocking map operations and more
memory requirements. if you set them low, there are more frequent
spills (more I/O operations), but less memory requirements. it just
seems like no matter what you do, you are stuck: you may stall the
mapper if the values are high because of the amount of time required
to spill an enormous amount of data; you may stall the mapper if the
values are low because of the amount of I/O operations required

i must be understanding something wrong here because everywhere i
read, hadoop is supposed to be #1 at sorting. but here, in dealing
with the intermediary key-value pairs, in the process of sorting,
mappers can stall for any number of reasons.

does anyone know any competitive dynamic hadoop clustering service
like AWS EMR? the reason why i ask is because AWS EMR does not use
HDFS (it uses S3), and therefore, data locality is not possible. also,
i have read the TCP protocol is not efficient for network transfers;
if the S3 node and task nodes are far, this distance will certainly
exacerbate the situation of slow speed. it seems there are a lot of
factors working against me.

any help is appreciated.

On Tue, Apr 3, 2012 at 7:48 AM, Bejoy Ks <[EMAIL PROTECTED]> wrote:
> Jane,
>       From my first look, properties that can help you could be
> - Increase io sort factor to 100
> - Increase io.sort.mb to 512Mb
> - increase map task heap size to 2GB.
> If the task still stalls, try providing lesser input for each mapper.
> Regards
> Bejoy KS
> On Tue, Apr 3, 2012 at 2:08 PM, Jane Wayne <[EMAIL PROTECTED]> wrote:
> > i have a map reduce job that is generating a lot of intermediate key-value
> > pairs. for example, when i am 1/3 complete with my map phase, i may have
> > generated over 130,000,000 output records (which is about 9 gigabytes). to
> > get to the 1/3 complete mark is very fast (less than 10 minutes), but at
> > the 1/3 complete mark, it seems to stall. when i look at the counter logs,
> > i do not see any logging of spilling yet. however, on the web job UI, i see