Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> how to fine tuning my map reduce job that is generating a lot of intermediate key-value pairs (a lot of I/O operations)


+
Jane Wayne 2012-04-03, 08:38
+
Bejoy Ks 2012-04-03, 11:48
+
Jane Wayne 2012-04-04, 03:11
+
Serge Blazhievsky 2012-04-04, 17:17
Copy link to this message
-
Re: how to fine tuning my map reduce job that is generating a lot of intermediate key-value pairs (a lot of I/O operations)
serge, i specify 15 instances, but only 14 end up being data/tasks
nodes. 1 instance is reserved as the name node (job tracker).

On Wed, Apr 4, 2012 at 1:17 PM, Serge Blazhievsky
<[EMAIL PROTECTED]> wrote:
> How many datanodes do you use fir your job?
>
> On 4/3/12 8:11 PM, "Jane Wayne" <[EMAIL PROTECTED]> wrote:
>
>>i don't have the option of setting the map heap size to 2 GB since my
>>real environment is AWS EMR and the constraints are set.
>>
>>http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html this
>>link is where i am currently reading on the meaning of io.sort.factor
>>and io.sort.mb.
>>
>>it seems io.sort.mb tunes the map tasks and io.sort.factor tunes the
>>shuffle/reduce task. am i correct to say then that io.sort.factor is
>>not relevant here (yet, anways)? since i don't really make it to the
>>reduce phase (except for only a very small data size).
>>
>>in that link above, here is the description for, io.sort.mb:  The
>>cumulative size of the serialization and accounting buffers storing
>>records emitted from the map, in megabytes. there's a paragraph above
>>the table that is value is simply the threshold that triggers a sort
>>and spill to the disk. furthermore, it says, "If either buffer fills
>>completely while the spill is in progress, the map thread will block,"
>>which is what i believe is happening in my case.
>>
>>this sentence concerns me, "Minimizing the number of spills to disk
>>can decrease map time, but a larger buffer also decreases the memory
>>available to the mapper." to minimize the number of spills, you need a
>>larger buffer; however, this statement seems to suggest to NOT
>>minimize the number of spills; a) you will not decrease map time, b)
>>you will not decrease the memory available to the mapper. so, in your
>>advice below, you say to increase, but i may actually want to decrease
>>the value for io.sort.mb. (if i understood the documentation
>>correctly, ????)
>>
>>it seems these three map tuning parameters, io.sort.mb,
>>io.sort.record.percent, and io.sort.spill.percent are a pain-point
>>trading off between speed and memory. to me, if you set them high,
>>more serialized data + metadata are stored in memory before a spill
>>(an I/O operation is performed). you also get less merges (less I/O
>>operation?), but the negatives are blocking map operations and more
>>memory requirements. if you set them low, there are more frequent
>>spills (more I/O operations), but less memory requirements. it just
>>seems like no matter what you do, you are stuck: you may stall the
>>mapper if the values are high because of the amount of time required
>>to spill an enormous amount of data; you may stall the mapper if the
>>values are low because of the amount of I/O operations required
>>(spill/merge).
>>
>>i must be understanding something wrong here because everywhere i
>>read, hadoop is supposed to be #1 at sorting. but here, in dealing
>>with the intermediary key-value pairs, in the process of sorting,
>>mappers can stall for any number of reasons.
>>
>>does anyone know any competitive dynamic hadoop clustering service
>>like AWS EMR? the reason why i ask is because AWS EMR does not use
>>HDFS (it uses S3), and therefore, data locality is not possible. also,
>>i have read the TCP protocol is not efficient for network transfers;
>>if the S3 node and task nodes are far, this distance will certainly
>>exacerbate the situation of slow speed. it seems there are a lot of
>>factors working against me.
>>
>>any help is appreciated.
>>
>>On Tue, Apr 3, 2012 at 7:48 AM, Bejoy Ks <[EMAIL PROTECTED]> wrote:
>>>
>>> Jane,
>>>       From my first look, properties that can help you could be
>>> - Increase io sort factor to 100
>>> - Increase io.sort.mb to 512Mb
>>> - increase map task heap size to 2GB.
>>>
>>> If the task still stalls, try providing lesser input for each mapper.
>>>
>>> Regards
>>> Bejoy KS
>>>
>>> On Tue, Apr 3, 2012 at 2:08 PM, Jane Wayne <[EMAIL PROTECTED]>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB