Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Slow MR time and high network utilization with all local data


Copy link to this message
-
Re: Slow MR time and high network utilization with all local data
Thanks for pointing me towards short circuit!  I dug around and couldnt
find in the logs any mention of the local reader loading, and then spotted
a config error.  So when I used HBase it set the short circuit via its
configs (which were correct), but when I didn't use HBase it failed to set
the short circuit.

Now I see no network utilization for this job and it runs *much* faster (13
mins instead of 2+ hours)!  Problem solved! :-)

Thanks Harsh!
On Mon, Feb 25, 2013 at 1:41 AM, Robert Dyer <[EMAIL PROTECTED]> wrote:

> I am using Ganglia.
>
> Note I have short circuit reads enabled (I think, I never verified it was
> working but I do get errors if I run jobs as another user).
>
> Also, if Ganglia's network use included the local socket then I would see
> network utilization in all cases.  I see no utilization when using HBase as
> MR input and MapFile.  I also see a small amount when using HBase for both
> (as one would expect).
>
>
> On Mon, Feb 25, 2013 at 1:22 AM, Harsh J <[EMAIL PROTECTED]> wrote:
>
>> Hi Robert,
>>
>> How are you measuring the network usage? Note that unless short
>> circuit reading is on, data reads are done over a local socket as
>> well, and may appear in network traffic observing tools too (but do
>> not mean they are over the network).
>>
>> On Mon, Feb 25, 2013 at 2:35 AM, Robert Dyer <[EMAIL PROTECTED]> wrote:
>> > I have a small 6 node dev cluster.  I use a 1GB SequenceFile as input
>> to a
>> > MapReduce job, using a custom split size of 10MB (to increase the
>> number of
>> > maps).  Each map call will read random entries out of a shared MapFile
>> (that
>> > is around 50GB).
>> >
>> > I set replication to 6 on both of these files, so all of the data
>> should be
>> > local for each map task.  I verified via fsck that no blocks are
>> > under-replicated.
>> >
>> > Despite this, for some reason the MR job maxes out the network and
>> takes an
>> > extremely long time.  What could be causing this?
>> >
>> > Note that the total number of map outputs for this job is around 400
>> and the
>> > reducer just passes the values through, so there shouldn't be much
>> network
>> > utilized by the output.
>> >
>> > As an experiment, I switched from the SeqFile input to an HBase table
>> and
>> > now see almost no network used.  I also tried leaving the SeqFile as
>> input
>> > and switched the MapFile to an HBase table and see about 30% network
>> used
>> > (which makes sense, as now that 50GB data isn't always local).
>> >
>> > What is going on here?  How can I debug to see what data is being
>> > transferred over the network?
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
>
> Robert Dyer
> [EMAIL PROTECTED]
>

--

Robert Dyer
[EMAIL PROTECTED]
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB