Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Re: Slow MR time and high network utilization with all local data


Copy link to this message
-
Re: Slow MR time and high network utilization with all local data
Hi Robert,

How are you measuring the network usage? Note that unless short
circuit reading is on, data reads are done over a local socket as
well, and may appear in network traffic observing tools too (but do
not mean they are over the network).

On Mon, Feb 25, 2013 at 2:35 AM, Robert Dyer <[EMAIL PROTECTED]> wrote:
> I have a small 6 node dev cluster.  I use a 1GB SequenceFile as input to a
> MapReduce job, using a custom split size of 10MB (to increase the number of
> maps).  Each map call will read random entries out of a shared MapFile (that
> is around 50GB).
>
> I set replication to 6 on both of these files, so all of the data should be
> local for each map task.  I verified via fsck that no blocks are
> under-replicated.
>
> Despite this, for some reason the MR job maxes out the network and takes an
> extremely long time.  What could be causing this?
>
> Note that the total number of map outputs for this job is around 400 and the
> reducer just passes the values through, so there shouldn't be much network
> utilized by the output.
>
> As an experiment, I switched from the SeqFile input to an HBase table and
> now see almost no network used.  I also tried leaving the SeqFile as input
> and switched the MapFile to an HBase table and see about 30% network used
> (which makes sense, as now that 50GB data isn't always local).
>
> What is going on here?  How can I debug to see what data is being
> transferred over the network?

--
Harsh J