Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Spill file compression

Copy link to this message
Re: Spill file compression
OK, just wanted to confirm. Maybe there is another problem then. I just
looked at the task logs and there were ~200 spills recorded for a single
task, only afterwards there was a merge phase. In my case, 200 spills are
about 2GB (uncompressed). One map output record easily fits into the
in-memory buffer, in fact, a few records fit into it. But Hadoop decides to
write gigabytes of spill to disk and it seems that the disk I/O and merging
make everything really slow. There doesn't seem to be a
max.num.spills.for.combine though. Is there any typical advise for this
kind of situation? Also, is there a way to see the size of the compressed
spill files to get a better idea about the file sizes I'm dealing with?
2012/11/7 Harsh J <[EMAIL PROTECTED]>

> Yes we do compress each spill output using the same codec as specified
> for map (intermediate) output compression. However, the counted bytes
> may be counting decompressed values of the records written, and not
> post-compressed ones.
> On Wed, Nov 7, 2012 at 6:02 PM, Sigurd Spieckermann
> <[EMAIL PROTECTED]> wrote:
> > Hi guys,
> >
> > I've encountered a situation where the ratio between "Map output bytes"
> and
> > "Map output materialized bytes" is quite huge and during the map-phase
> data
> > is spilled to disk quite a lot. This is something I'll try to optimize,
> but
> > I'm wondering if the spill files are compressed at all. I set
> > mapred.compress.map.output=true and
> >
> mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
> > and everything else seems to be working correctly. Does Hadoop actually
> > compress spills or just the final spill after finishing the entire
> map-task?
> >
> > Thanks,
> > Sigurd
> --
> Harsh J