Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Slow MR time and high network utilization with all local data

Copy link to this message
Slow MR time and high network utilization with all local data
I have a small 6 node dev cluster.  I use a 1GB SequenceFile as input to a
MapReduce job, using a custom split size of 10MB (to increase the number of
maps).  Each map call will read random entries out of a shared MapFile
(that is around 50GB).

I set replication to 6 on both of these files, so all of the data should be
local for each map task.  I verified via fsck that no blocks are

Despite this, for some reason the MR job maxes out the network and takes an
extremely long time.  What could be causing this?

Note that the total number of map outputs for this job is around 400 and
the reducer just passes the values through, so there shouldn't be much
network utilized by the output.

As an experiment, I switched from the SeqFile input to an HBase table and
now see almost no network used.  I also tried leaving the SeqFile as input
and switched the MapFile to an HBase table and see about 30% network used
(which makes sense, as now that 50GB data isn't always local).

What is going on here?  How can I debug to see what data is being
transferred over the network?