Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # dev >> Questions about recommendation value of the "io.sort.mb" parameter


Copy link to this message
-
Re: Questions about recommendation value of the "io.sort.mb" parameter
Hi Todd,

Thanks a lot for your detailed explanation and recommendation, it really
helps a lot!

Best Regards,
Carp

2010/6/26 Todd Lipcon <[EMAIL PROTECTED]>

> 2010/6/25 Yu Li <[EMAIL PROTECTED]>
>
> > Hi Todd,
> >
> > Sorry for bother again, could you further explain what's the 24 bytes
> > additional overhead for each record of map output? What cost the overhead
> > and what it is for? Thanks a lot.
> >
>
> I actually misremembered, sorry - it's 16 bytes.
>
> In the kvindices buffer:
> 4 bytes for partition ID of each record
> 4 bytes for the key offset in data buffer
> 4 bytes for the value offset in data buffer
>
> In the kvoffsets buffer:
> 4 bytes for an index into the kvindices buffer (this is so that the spill
> sort can just move around indices instead of the entire object)
>
> For more detail, I would recommend reading the code, or looking for Chris
> Douglas's slides from the HUG earlier this year where he gave a very
> informative talk on the evolution of the mapside spill.
>
> -Todd
>
>
> >
> > Best Regards,
> > Carp
> > 在 2010年6月24日 上午1:49,Todd Lipcon <[EMAIL PROTECTED]>写道:
> >
> > > Plus there's some overhead for each record of map output. Specifically,
> > 24
> > > bytes. So if you output 64MB worth of data, but each of your objects is
> > > only
> > > 24 bytes long itself, you need more than 128MB worth of spill space for
> > it.
> > > Last, the map output buffer begins spilling when it is partially full
> so
> > > that more records can be collected while spill proceeds.
> > >
> > > 200MB io.sort.mb has enough headroom for most 64M input splits that
> don't
> > > blow up the data a lot. Expanding much above 200M for most jobs doesn't
> > buy
> > > you much. Good news is it's easy to tell by looking at the logs to see
> > how
> > > many times the map tasks are spilling. If you're only spilling once,
> more
> > > io.sort.mb will not help.
> > >
> > > -Todd
> > >
> > > 2010/6/23 李钰 <[EMAIL PROTECTED]>
> > >
> > > > Hi Jeff,
> > > >
> > > > Thanks for your quick reply. Seems my thinking is stuck on the job
> > style
> > > > I'm
> > > > running. Now I'm much clearer about it.
> > > >
> > > > Best Regards,
> > > > Carp
> > > >
> > > > 2010/6/23 Jeff Zhang <[EMAIL PROTECTED]>
> > > >
> > > > > Hi 李钰
> > > > >
> > > > > The size of map output depends on your Mapper class. The Mapper
> class
> > > > > will do processing on the input data.
> > > > >
> > > > >
> > > > >
> > > > > 2010/6/23 李钰 <[EMAIL PROTECTED]>:
> > > > >  > Hi Sriguru,
> > > > > >
> > > > > > Thanks a lot for your comments and suggestions!
> > > > > > Here I still have some questions: since map mainly do data
> > > preparation,
> > > > > > say split input data into KVPs, sort and partition before spill,
> > > would
> > > > > the
> > > > > > size of map output KVPs be much larger than the input data size?
> If
> > > > not,
> > > > > > since one map task deals with one input split, and one input
> split
> > is
> > > > > > usually 64M, the map KVPs size would be proximately 64M. Could
> you
> > > > please
> > > > > > give me some example on map output much larger than the input
> > split?
> > > It
> > > > > > really confuse me for some time, thanks.
> > > > > >
> > > > > > Others,
> > > > > >
> > > > > > Also badly need your help if you know about this, thanks.
> > > > > >
> > > > > > Best Regards,
> > > > > > Carp
> > > > > >
> > > > > > 在 2010年6月23日 下午5:11,Srigurunath Chakravarthi <
> > [EMAIL PROTECTED]
> > > > >写道:
> > > > > >
> > > > > >> Hi Carp,
> > > > > >>  Your assumption is right that this is a per-map-task setting.
> > > > > >> However, this buffer stores map output KVPs, not input.
> Therefore
> > > the
> > > > > >> optimal value depends on how much data your map task is
> > generating.
> > > > > >>
> > > > > >> If your output per map is greater than io.sort.mb, these rules
> of
> > > > thumb
> > > > > >> that could work for you:
> > > > > >>
> > > > > >> 1) Increase max heap of map tasks to use RAM better, but not hit
> > > swap.
> > > > > >> 2) Set io.sort.mb to ~70% of heap.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB