I am currently pushing the limits of my hadoop cluster. Unfortunately I am
a bit confused about the memory requirements during the copy phase and the
sort phases. I have made an effort to fully explain my setup and problems
at the following link:
(i provide this link because the formatting makes it easier to read than a
In short my questions are:
-Does the copy phase requires that my map output data is at two places at
the same time (source and destination).
-Is my understanding of map output flow correct? Map -> spill -> sort ->
compress -> local disk, and later: decompress + merge/shuffle.
-How does the reduce sorting work? Is it 10 map output files per time that
are decompressed and then merged? How many merge rounds are to be expected
(since it doesn't merge until only one file left, not the decompression
might pose big problems here!).
-Given my setup with
-30 slaves with hdisks of 100 GB
-1800 map tasks which generate 8GB map output bytes and 1GB map output
materialized bytes, 30 slaves
-100 reduce taks which have 18 GB input and 5 GB output
Is it possible to fit everything in this cluster by playing with the
environment settings or is it just too much for this cluster, and can you
help me estimate my memory requirements based on this information?
Thanks a lot for any help!
Dieter De Witte
PHD Student @ Ghent University