Hadoop the definitive guide says: reduce tasks will start only when all
maps has done their work. Also this
>> The shuffle and sort phases occur simultaneously; while map-outputs are
being fetched they are merged.
What I have understood is that when a reducer task starts then all data it
needs(including a key and associated values) have been transferred to its
local node. Am I right? if this is true then, the node running reduce task
must have enough storage to hold all values associated with that key, else
The job will fail.
If no, then reduce job starts with some available data and shuffle + sort
phase feed reduce task contiguously, thus low storage on node does not
cause problem because data is coming on demand.
which of the two cases actually happen?