-Re: Proper blocksize and io.sort.mb setting when using compressed LZO files
Ted Yu 2010-09-27, 15:18
The setting should be fs.inmemory.size.mb
On Mon, Sep 27, 2010 at 7:15 AM, pig <[EMAIL PROTECTED]> wrote:
> HI Sriguru,
> Thank you for the tips. Just to clarify a few things.
> Our machines have 32 GB of RAM.
> I'm planning on setting each machine to run 12 mappers and 2 reducers with
> the heap size set to 2048MB so total memory usage for the heap at 28GB.
> If this is the case should io.sort.mb be set to 70% of 2048MB (so ~1400
> Also, I did not see a fs.inmemorysize.mb setting in any of the hadoop
> configuration files. Is that the correct setting I should be looking for?
> Should this also be set to 70% of the heap size or does it need to share
> with the io.sort.mb setting.
> I assume if I'm bumping up io.sort.mb that much I also need to increase
> io.sort.factor from the default of 10. Is there a recommended relation
> between these two?
> Thank you for your help!
> On Sun, Sep 26, 2010 at 3:05 AM, Srigurunath Chakravarthi <
> [EMAIL PROTECTED]> wrote:
> > Ed,
> > Tuning io.sort.mb will be certainly worthwhile if you have enough RAM to
> > allow for a higher Java heap per map task without risking swapping.
> > Similarly, you can decrease spills on the reduce side using
> > fs.inmemorysize.mb.
> > You can use the following thumb rules for tuning those two:
> > - Set these to ~70% of Java heap size. Pick heap sizes to utilize ~80%
> > across all processes (maps, reducers, TT, DN, other)
> > - Set it small enough to avoid swap activity, but
> > - Set it large enough to minimize disk spills.
> > - Ensure that io.sort.factor is set large enough to allow full use of
> > buffer space.
> > - Balance space for output records (default 95%) & record meta-data (5%).
> > Use io.sort.spill.percent and io.sort.record.percent
> > Your mileage may vary. We've seen job exec time improvements worth 1-3%
> > via spill-avoidance for miscellaneous applications.
> > Your other option of running a map per 32MB or 64MB of input should give
> > you better performance if your map task execution time is significant
> > much larger than a few seconds) compared to the overhead of launching map
> > tasks and reading input.
> > Regards,
> > Sriguru
> > >-----Original Message-----
> > >From: pig [mailto:[EMAIL PROTECTED]]
> > >Sent: Saturday, September 25, 2010 2:36 AM
> > >To: [EMAIL PROTECTED]
> > >Subject: Proper blocksize and io.sort.mb setting when using compressed
> > >LZO files
> > >
> > >Hello,
> > >
> > >We just recently switched to using lzo compressed file input for our
> > >hadoop
> > >cluster using Kevin Weil's lzo library. The files are pretty uniform
> > >in
> > >size at around 200MB compressed. Our block size is 256MB.
> > >Decompressed the
> > >average LZO input file is around 1.0GB. I noticed lots of our jobs are
> > >now
> > >spilling lots of data to disk. We have almost 3x more spilled records
> > >than
> > >map input records for example. I'm guessing this is because each
> > >mapper is
> > >getting a 200 MB lzo file which decompresses into 1GB of data per
> > >mapper.
> > >
> > >Would you recommend solving this by reducing the block size to 64MB, or
> > >even
> > >32MB and then using the LZO indexer so that a single 200MB lzo file is
> > >actually split among 3 or 4 mappers? Would it be better to play with
> > >the
> > >io.sort.mb value? Or, would it be best to play with both? Right now
> > >the
> > >io.sort.mb value is the default 200MB. Have other lzo users had to
> > >adjust
> > >their block size to compensate for the "expansion" of the data after
> > >decompression?
> > >
> > >Thank you for any help!
> > >
> > >~Ed