-Slow MR time and high network utilization with all local data
Robert Dyer 2013-02-24, 21:05
I have a small 6 node dev cluster. I use a 1GB SequenceFile as input to a
MapReduce job, using a custom split size of 10MB (to increase the number of
maps). Each map call will read random entries out of a shared MapFile
(that is around 50GB).
I set replication to 6 on both of these files, so all of the data should be
local for each map task. I verified via fsck that no blocks are
Despite this, for some reason the MR job maxes out the network and takes an
extremely long time. What could be causing this?
Note that the total number of map outputs for this job is around 400 and
the reducer just passes the values through, so there shouldn't be much
network utilized by the output.
As an experiment, I switched from the SeqFile input to an HBase table and
now see almost no network used. I also tried leaving the SeqFile as input
and switched the MapFile to an HBase table and see about 30% network used
(which makes sense, as now that 50GB data isn't always local).
What is going on here? How can I debug to see what data is being
transferred over the network?