Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Slow MR time and high network utilization with all local data


Copy link to this message
-
Re: Slow MR time and high network utilization with all local data
Robert Dyer 2013-02-25, 07:41
I am using Ganglia.

Note I have short circuit reads enabled (I think, I never verified it was
working but I do get errors if I run jobs as another user).

Also, if Ganglia's network use included the local socket then I would see
network utilization in all cases.  I see no utilization when using HBase as
MR input and MapFile.  I also see a small amount when using HBase for both
(as one would expect).
On Mon, Feb 25, 2013 at 1:22 AM, Harsh J <[EMAIL PROTECTED]> wrote:

> Hi Robert,
>
> How are you measuring the network usage? Note that unless short
> circuit reading is on, data reads are done over a local socket as
> well, and may appear in network traffic observing tools too (but do
> not mean they are over the network).
>
> On Mon, Feb 25, 2013 at 2:35 AM, Robert Dyer <[EMAIL PROTECTED]> wrote:
> > I have a small 6 node dev cluster.  I use a 1GB SequenceFile as input to
> a
> > MapReduce job, using a custom split size of 10MB (to increase the number
> of
> > maps).  Each map call will read random entries out of a shared MapFile
> (that
> > is around 50GB).
> >
> > I set replication to 6 on both of these files, so all of the data should
> be
> > local for each map task.  I verified via fsck that no blocks are
> > under-replicated.
> >
> > Despite this, for some reason the MR job maxes out the network and takes
> an
> > extremely long time.  What could be causing this?
> >
> > Note that the total number of map outputs for this job is around 400 and
> the
> > reducer just passes the values through, so there shouldn't be much
> network
> > utilized by the output.
> >
> > As an experiment, I switched from the SeqFile input to an HBase table and
> > now see almost no network used.  I also tried leaving the SeqFile as
> input
> > and switched the MapFile to an HBase table and see about 30% network used
> > (which makes sense, as now that 50GB data isn't always local).
> >
> > What is going on here?  How can I debug to see what data is being
> > transferred over the network?
>
>
>
> --
> Harsh J
>

--

Robert Dyer
[EMAIL PROTECTED]