Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Spill file compression


Copy link to this message
-
Re: Spill file compression
OK, I found the answer to one of my questions just now -- the location of
the spill files and their sizes. So, there's a discrepancy between what I
see and what you said about the compression. The total size of all spill
files of a single task matches with what I estimate for them to be
*without* compression. It seems they aren't compressed, but that's strange
because I definitely enabled compression the way I described.
2012/11/7 Sigurd Spieckermann <[EMAIL PROTECTED]>

> OK, just wanted to confirm. Maybe there is another problem then. I just
> looked at the task logs and there were ~200 spills recorded for a single
> task, only afterwards there was a merge phase. In my case, 200 spills are
> about 2GB (uncompressed). One map output record easily fits into the
> in-memory buffer, in fact, a few records fit into it. But Hadoop decides to
> write gigabytes of spill to disk and it seems that the disk I/O and merging
> make everything really slow. There doesn't seem to be a
> max.num.spills.for.combine though. Is there any typical advise for this
> kind of situation? Also, is there a way to see the size of the compressed
> spill files to get a better idea about the file sizes I'm dealing with?
>
>
>
> 2012/11/7 Harsh J <[EMAIL PROTECTED]>
>
>> Yes we do compress each spill output using the same codec as specified
>> for map (intermediate) output compression. However, the counted bytes
>> may be counting decompressed values of the records written, and not
>> post-compressed ones.
>>
>> On Wed, Nov 7, 2012 at 6:02 PM, Sigurd Spieckermann
>> <[EMAIL PROTECTED]> wrote:
>> > Hi guys,
>> >
>> > I've encountered a situation where the ratio between "Map output bytes"
>> and
>> > "Map output materialized bytes" is quite huge and during the map-phase
>> data
>> > is spilled to disk quite a lot. This is something I'll try to optimize,
>> but
>> > I'm wondering if the spill files are compressed at all. I set
>> > mapred.compress.map.output=true and
>> >
>> mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
>> > and everything else seems to be working correctly. Does Hadoop actually
>> > compress spills or just the final spill after finishing the entire
>> map-task?
>> >
>> > Thanks,
>> > Sigurd
>>
>>
>>
>> --
>> Harsh J
>>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB