Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Slow MR time and high network utilization with all local data


Copy link to this message
-
Re: Slow MR time and high network utilization with all local data
I am using Ganglia.

Note I have short circuit reads enabled (I think, I never verified it was
working but I do get errors if I run jobs as another user).

Also, if Ganglia's network use included the local socket then I would see
network utilization in all cases.  I see no utilization when using HBase as
MR input and MapFile.  I also see a small amount when using HBase for both
(as one would expect).
On Mon, Feb 25, 2013 at 1:22 AM, Harsh J <[EMAIL PROTECTED]> wrote:

> Hi Robert,
>
> How are you measuring the network usage? Note that unless short
> circuit reading is on, data reads are done over a local socket as
> well, and may appear in network traffic observing tools too (but do
> not mean they are over the network).
>
> On Mon, Feb 25, 2013 at 2:35 AM, Robert Dyer <[EMAIL PROTECTED]> wrote:
> > I have a small 6 node dev cluster.  I use a 1GB SequenceFile as input to
> a
> > MapReduce job, using a custom split size of 10MB (to increase the number
> of
> > maps).  Each map call will read random entries out of a shared MapFile
> (that
> > is around 50GB).
> >
> > I set replication to 6 on both of these files, so all of the data should
> be
> > local for each map task.  I verified via fsck that no blocks are
> > under-replicated.
> >
> > Despite this, for some reason the MR job maxes out the network and takes
> an
> > extremely long time.  What could be causing this?
> >
> > Note that the total number of map outputs for this job is around 400 and
> the
> > reducer just passes the values through, so there shouldn't be much
> network
> > utilized by the output.
> >
> > As an experiment, I switched from the SeqFile input to an HBase table and
> > now see almost no network used.  I also tried leaving the SeqFile as
> input
> > and switched the MapFile to an HBase table and see about 30% network used
> > (which makes sense, as now that 50GB data isn't always local).
> >
> > What is going on here?  How can I debug to see what data is being
> > transferred over the network?
>
>
>
> --
> Harsh J
>

--

Robert Dyer
[EMAIL PROTECTED]
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB