|
|
-
Re: how to fine tuning my map reduce job that is generating a lot of intermediate key-value pairs (a lot of I/O operations)Jane Wayne 2012-04-05, 03:15
serge, i specify 15 instances, but only 14 end up being data/tasks
nodes. 1 instance is reserved as the name node (job tracker). On Wed, Apr 4, 2012 at 1:17 PM, Serge Blazhievsky <[EMAIL PROTECTED]> wrote: > How many datanodes do you use fir your job? > > On 4/3/12 8:11 PM, "Jane Wayne" <[EMAIL PROTECTED]> wrote: > >>i don't have the option of setting the map heap size to 2 GB since my >>real environment is AWS EMR and the constraints are set. >> >>http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html this >>link is where i am currently reading on the meaning of io.sort.factor >>and io.sort.mb. >> >>it seems io.sort.mb tunes the map tasks and io.sort.factor tunes the >>shuffle/reduce task. am i correct to say then that io.sort.factor is >>not relevant here (yet, anways)? since i don't really make it to the >>reduce phase (except for only a very small data size). >> >>in that link above, here is the description for, io.sort.mb: The >>cumulative size of the serialization and accounting buffers storing >>records emitted from the map, in megabytes. there's a paragraph above >>the table that is value is simply the threshold that triggers a sort >>and spill to the disk. furthermore, it says, "If either buffer fills >>completely while the spill is in progress, the map thread will block," >>which is what i believe is happening in my case. >> >>this sentence concerns me, "Minimizing the number of spills to disk >>can decrease map time, but a larger buffer also decreases the memory >>available to the mapper." to minimize the number of spills, you need a >>larger buffer; however, this statement seems to suggest to NOT >>minimize the number of spills; a) you will not decrease map time, b) >>you will not decrease the memory available to the mapper. so, in your >>advice below, you say to increase, but i may actually want to decrease >>the value for io.sort.mb. (if i understood the documentation >>correctly, ????) >> >>it seems these three map tuning parameters, io.sort.mb, >>io.sort.record.percent, and io.sort.spill.percent are a pain-point >>trading off between speed and memory. to me, if you set them high, >>more serialized data + metadata are stored in memory before a spill >>(an I/O operation is performed). you also get less merges (less I/O >>operation?), but the negatives are blocking map operations and more >>memory requirements. if you set them low, there are more frequent >>spills (more I/O operations), but less memory requirements. it just >>seems like no matter what you do, you are stuck: you may stall the >>mapper if the values are high because of the amount of time required >>to spill an enormous amount of data; you may stall the mapper if the >>values are low because of the amount of I/O operations required >>(spill/merge). >> >>i must be understanding something wrong here because everywhere i >>read, hadoop is supposed to be #1 at sorting. but here, in dealing >>with the intermediary key-value pairs, in the process of sorting, >>mappers can stall for any number of reasons. >> >>does anyone know any competitive dynamic hadoop clustering service >>like AWS EMR? the reason why i ask is because AWS EMR does not use >>HDFS (it uses S3), and therefore, data locality is not possible. also, >>i have read the TCP protocol is not efficient for network transfers; >>if the S3 node and task nodes are far, this distance will certainly >>exacerbate the situation of slow speed. it seems there are a lot of >>factors working against me. >> >>any help is appreciated. >> >>On Tue, Apr 3, 2012 at 7:48 AM, Bejoy Ks <[EMAIL PROTECTED]> wrote: >>> >>> Jane, >>> From my first look, properties that can help you could be >>> - Increase io sort factor to 100 >>> - Increase io.sort.mb to 512Mb >>> - increase map task heap size to 2GB. >>> >>> If the task still stalls, try providing lesser input for each mapper. >>> >>> Regards >>> Bejoy KS >>> >>> On Tue, Apr 3, 2012 at 2:08 PM, Jane Wayne <[EMAIL PROTECTED]> |