Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # dev >> Questions about recommendation value of the "io.sort.mb" parameter


Copy link to this message
-
Re: Questions about recommendation value of the "io.sort.mb" parameter
Hi 李钰

The size of map output depends on your Mapper class. The Mapper class
will do processing on the input data.

2010/6/23 李钰 <[EMAIL PROTECTED]>:
> Hi Sriguru,
>
> Thanks a lot for your comments and suggestions!
> Here I still have some questions: since map mainly do data preparation,
> say split input data into KVPs, sort and partition before spill, would the
> size of map output KVPs be much larger than the input data size? If not,
> since one map task deals with one input split, and one input split is
> usually 64M, the map KVPs size would be proximately 64M. Could you please
> give me some example on map output much larger than the input split? It
> really confuse me for some time, thanks.
>
> Others,
>
> Also badly need your help if you know about this, thanks.
>
> Best Regards,
> Carp
>
> 在 2010年6月23日 下午5:11��Srigurunath Chakravarthi <[EMAIL PROTECTED]>写道��
>
>> Hi Carp,
>>  Your assumption is right that this is a per-map-task setting.
>> However, this buffer stores map output KVPs, not input. Therefore the
>> optimal value depends on how much data your map task is generating.
>>
>> If your output per map is greater than io.sort.mb, these rules of thumb
>> that could work for you:
>>
>> 1) Increase max heap of map tasks to use RAM better, but not hit swap.
>> 2) Set io.sort.mb to ~70% of heap.
>>
>> Overall, causing extra "spills" (because of insufficient io.sort.mb) is
>> much better than risking swapping (by setting io.sort.mb and heap too
>> large), in terms of relative performance penalty you will pay.
>>
>> Cheers,
>> Sriguru
>>
>> >-----Original Message-----
>> >From: 李钰 [mailto:[EMAIL PROTECTED]]
>> >Sent: Wednesday, June 23, 2010 12:27 PM
>> >To: [EMAIL PROTECTED]
>> >Subject: Questions about recommendation value of the "io.sort.mb"
>> >parameter
>> >
>> >Dear all,
>> >
>> >Here I've got a question about the "io.sort.mb" parameter. We can find
>> >material from Yahoo! or Cloudera which recommend setting this value to
>> >200
>> >if the job scale is large, but I'm confused about this. As I know,
>> >the tasktracker will launch a child-JVM for each task, and
>> >“*io.sort.mb*”
>> >presents the buffer size in memory inside *one map task child-JVM*, the
>> >default value 100MB should be large enough because the input split of
>> >one
>> >map task is usually 64MB, as large as the block size we usually set.
>> >Then
>> >why the recommendation of “*io.sort.mb*” is 200MB for large jobs (and
>> >it
>> >really works)? How could the job size affect the procedure?
>> >Is there any fault here of my understanding? Any comment/suggestion
>> >will be
>> >highly valued, thanks in advance.
>> >
>> >Best Regards,
>> >Carp
>>
>

--
Best Regards

Jeff Zhang
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB