Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Re: Slow MR time and high network utilization with all local data


Copy link to this message
-
Re: Slow MR time and high network utilization with all local data
Harsh J 2013-02-25, 07:22
Hi Robert,

How are you measuring the network usage? Note that unless short
circuit reading is on, data reads are done over a local socket as
well, and may appear in network traffic observing tools too (but do
not mean they are over the network).

On Mon, Feb 25, 2013 at 2:35 AM, Robert Dyer <[EMAIL PROTECTED]> wrote:
> I have a small 6 node dev cluster.  I use a 1GB SequenceFile as input to a
> MapReduce job, using a custom split size of 10MB (to increase the number of
> maps).  Each map call will read random entries out of a shared MapFile (that
> is around 50GB).
>
> I set replication to 6 on both of these files, so all of the data should be
> local for each map task.  I verified via fsck that no blocks are
> under-replicated.
>
> Despite this, for some reason the MR job maxes out the network and takes an
> extremely long time.  What could be causing this?
>
> Note that the total number of map outputs for this job is around 400 and the
> reducer just passes the values through, so there shouldn't be much network
> utilized by the output.
>
> As an experiment, I switched from the SeqFile input to an HBase table and
> now see almost no network used.  I also tried leaving the SeqFile as input
> and switched the MapFile to an HBase table and see about 30% network used
> (which makes sense, as now that 50GB data isn't always local).
>
> What is going on here?  How can I debug to see what data is being
> transferred over the network?

--
Harsh J