Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # dev >> Questions about recommendation value of the "io.sort.mb" parameter


Copy link to this message
-
Re: Questions about recommendation value of the "io.sort.mb" parameter
Hi Todd,

Thanks a lot for your detailed explanation and recommendation, it really
helps a lot!

Best Regards,
Carp

2010/6/26 Todd Lipcon <[EMAIL PROTECTED]>

> 2010/6/25 Yu Li <[EMAIL PROTECTED]>
>
> > Hi Todd,
> >
> > Sorry for bother again, could you further explain what's the 24 bytes
> > additional overhead for each record of map output? What cost the overhead
> > and what it is for? Thanks a lot.
> >
>
> I actually misremembered, sorry - it's 16 bytes.
>
> In the kvindices buffer:
> 4 bytes for partition ID of each record
> 4 bytes for the key offset in data buffer
> 4 bytes for the value offset in data buffer
>
> In the kvoffsets buffer:
> 4 bytes for an index into the kvindices buffer (this is so that the spill
> sort can just move around indices instead of the entire object)
>
> For more detail, I would recommend reading the code, or looking for Chris
> Douglas's slides from the HUG earlier this year where he gave a very
> informative talk on the evolution of the mapside spill.
>
> -Todd
>
>
> >
> > Best Regards,
> > Carp
> > 在 2010年6月24日 上午1:49,Todd Lipcon <[EMAIL PROTECTED]>写道:
> >
> > > Plus there's some overhead for each record of map output. Specifically,
> > 24
> > > bytes. So if you output 64MB worth of data, but each of your objects is
> > > only
> > > 24 bytes long itself, you need more than 128MB worth of spill space for
> > it.
> > > Last, the map output buffer begins spilling when it is partially full
> so
> > > that more records can be collected while spill proceeds.
> > >
> > > 200MB io.sort.mb has enough headroom for most 64M input splits that
> don't
> > > blow up the data a lot. Expanding much above 200M for most jobs doesn't
> > buy
> > > you much. Good news is it's easy to tell by looking at the logs to see
> > how
> > > many times the map tasks are spilling. If you're only spilling once,
> more
> > > io.sort.mb will not help.
> > >
> > > -Todd
> > >
> > > 2010/6/23 李钰 <[EMAIL PROTECTED]>
> > >
> > > > Hi Jeff,
> > > >
> > > > Thanks for your quick reply. Seems my thinking is stuck on the job
> > style
> > > > I'm
> > > > running. Now I'm much clearer about it.
> > > >
> > > > Best Regards,
> > > > Carp
> > > >
> > > > 2010/6/23 Jeff Zhang <[EMAIL PROTECTED]>
> > > >
> > > > > Hi 李钰
> > > > >
> > > > > The size of map output depends on your Mapper class. The Mapper
> class
> > > > > will do processing on the input data.
> > > > >
> > > > >
> > > > >
> > > > > 2010/6/23 李钰 <[EMAIL PROTECTED]>:
> > > > >  > Hi Sriguru,
> > > > > >
> > > > > > Thanks a lot for your comments and suggestions!
> > > > > > Here I still have some questions: since map mainly do data
> > > preparation,
> > > > > > say split input data into KVPs, sort and partition before spill,
> > > would
> > > > > the
> > > > > > size of map output KVPs be much larger than the input data size?
> If
> > > > not,
> > > > > > since one map task deals with one input split, and one input
> split
> > is
> > > > > > usually 64M, the map KVPs size would be proximately 64M. Could
> you
> > > > please
> > > > > > give me some example on map output much larger than the input
> > split?
> > > It
> > > > > > really confuse me for some time, thanks.
> > > > > >
> > > > > > Others,
> > > > > >
> > > > > > Also badly need your help if you know about this, thanks.
> > > > > >
> > > > > > Best Regards,
> > > > > > Carp
> > > > > >
> > > > > > 在 2010年6月23日 下午5:11,Srigurunath Chakravarthi <
> > [EMAIL PROTECTED]
> > > > >写道:
> > > > > >
> > > > > >> Hi Carp,
> > > > > >>  Your assumption is right that this is a per-map-task setting.
> > > > > >> However, this buffer stores map output KVPs, not input.
> Therefore
> > > the
> > > > > >> optimal value depends on how much data your map task is
> > generating.
> > > > > >>
> > > > > >> If your output per map is greater than io.sort.mb, these rules
> of
> > > > thumb
> > > > > >> that could work for you:
> > > > > >>
> > > > > >> 1) Increase max heap of map tasks to use RAM better, but not hit
> > > swap.
> > > > > >> 2) Set io.sort.mb to ~70% of heap.