Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Spill file compression


Copy link to this message
-
Re: Spill file compression
When I log the calls of the combiner function and print the number of
elements iterated over, it is all 1 during the spill-writing phase and the
combiner is called very often. Is this normal behavior? According to what
mentioned earlier, I would expect the combiner to combine all records with
the same key that are in the in-memory buffer before the spill and it
should be at least a few per spill in my case. This is confusing...
2012/11/7 Sigurd Spieckermann <[EMAIL PROTECTED]>

> Hm, maybe I need some clarification on what the combiner exactly does.
> From what I understand from "Hadoop - The Definitive Guide", there are a
> few occasions when a combiner may be called before the sort&shuffle phase.
>
> 1) Once the in-memory buffer reaches the threshold it will spill out to
> disk. "Before it writes to disk, the thread first divides the data into
> partitions corresponding to the reducers that they will ultimately be sent
> to. Within each partition, the background thread performs an in-memory sort
> by key, and if there is a combiner function, it is run on the output of the
> sort. Running the combiner function makes for a more compact map output, so
> there is less data to write to local disk and to transfer to the reducer."
> So to me, this means that the combiner at this point only operates on the
> data that is located in the in-memory buffer. If the buffer can keep at
> most n records with k distinct keys (uniformly distributed), then the
> combiner will cause a reduction in records spilled to disk by a factor of
> k. (correct?)
>
> 2) "Before the task is finished, the spill files are merged into a single
> partitioned and sorted output file. [...] If there are at least three spill
> files (set by the min.num.spills.for.combine property) then the combiner is
> run again before the output file is written." So the number of spill files
> is not affected by the use of a combiner, only their sizes usually get
> reduced and only at the end of the map task, all spill files are touched
> again, merged and combined. If I have k distinct keys per map-task, then I
> will be guaranteed to have k records at the very end of the map-task.
> (correct?)
>
> Is there any other occasion when the combiner may be called? Are spill
> files ever touched again before the final merge?
>
> Thanks,
> Sigurd
>
>
>
> 2012/11/7 Sigurd Spieckermann <[EMAIL PROTECTED]>
>
>> OK, I found the answer to one of my questions just now -- the location of
>> the spill files and their sizes. So, there's a discrepancy between what I
>> see and what you said about the compression. The total size of all spill
>> files of a single task matches with what I estimate for them to be
>> *without* compression. It seems they aren't compressed, but that's strange
>> because I definitely enabled compression the way I described.
>>
>>
>>
>> 2012/11/7 Sigurd Spieckermann <[EMAIL PROTECTED]>
>>
>>> OK, just wanted to confirm. Maybe there is another problem then. I just
>>> looked at the task logs and there were ~200 spills recorded for a single
>>> task, only afterwards there was a merge phase. In my case, 200 spills are
>>> about 2GB (uncompressed). One map output record easily fits into the
>>> in-memory buffer, in fact, a few records fit into it. But Hadoop decides to
>>> write gigabytes of spill to disk and it seems that the disk I/O and merging
>>> make everything really slow. There doesn't seem to be a
>>> max.num.spills.for.combine though. Is there any typical advise for this
>>> kind of situation? Also, is there a way to see the size of the compressed
>>> spill files to get a better idea about the file sizes I'm dealing with?
>>>
>>>
>>>
>>> 2012/11/7 Harsh J <[EMAIL PROTECTED]>
>>>
>>>> Yes we do compress each spill output using the same codec as specified
>>>> for map (intermediate) output compression. However, the counted bytes
>>>> may be counting decompressed values of the records written, and not
>>>> post-compressed ones.
>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB