Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: Slow MR time and high network utilization with all local data


+
Robert Dyer 2013-02-25, 07:41
Copy link to this message
-
Re: Slow MR time and high network utilization with all local data
Thanks for pointing me towards short circuit!  I dug around and couldnt
find in the logs any mention of the local reader loading, and then spotted
a config error.  So when I used HBase it set the short circuit via its
configs (which were correct), but when I didn't use HBase it failed to set
the short circuit.

Now I see no network utilization for this job and it runs *much* faster (13
mins instead of 2+ hours)!  Problem solved! :-)

Thanks Harsh!
On Mon, Feb 25, 2013 at 1:41 AM, Robert Dyer <[EMAIL PROTECTED]> wrote:

> I am using Ganglia.
>
> Note I have short circuit reads enabled (I think, I never verified it was
> working but I do get errors if I run jobs as another user).
>
> Also, if Ganglia's network use included the local socket then I would see
> network utilization in all cases.  I see no utilization when using HBase as
> MR input and MapFile.  I also see a small amount when using HBase for both
> (as one would expect).
>
>
> On Mon, Feb 25, 2013 at 1:22 AM, Harsh J <[EMAIL PROTECTED]> wrote:
>
>> Hi Robert,
>>
>> How are you measuring the network usage? Note that unless short
>> circuit reading is on, data reads are done over a local socket as
>> well, and may appear in network traffic observing tools too (but do
>> not mean they are over the network).
>>
>> On Mon, Feb 25, 2013 at 2:35 AM, Robert Dyer <[EMAIL PROTECTED]> wrote:
>> > I have a small 6 node dev cluster.  I use a 1GB SequenceFile as input
>> to a
>> > MapReduce job, using a custom split size of 10MB (to increase the
>> number of
>> > maps).  Each map call will read random entries out of a shared MapFile
>> (that
>> > is around 50GB).
>> >
>> > I set replication to 6 on both of these files, so all of the data
>> should be
>> > local for each map task.  I verified via fsck that no blocks are
>> > under-replicated.
>> >
>> > Despite this, for some reason the MR job maxes out the network and
>> takes an
>> > extremely long time.  What could be causing this?
>> >
>> > Note that the total number of map outputs for this job is around 400
>> and the
>> > reducer just passes the values through, so there shouldn't be much
>> network
>> > utilized by the output.
>> >
>> > As an experiment, I switched from the SeqFile input to an HBase table
>> and
>> > now see almost no network used.  I also tried leaving the SeqFile as
>> input
>> > and switched the MapFile to an HBase table and see about 30% network
>> used
>> > (which makes sense, as now that 50GB data isn't always local).
>> >
>> > What is going on here?  How can I debug to see what data is being
>> > transferred over the network?
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
>
> Robert Dyer
> [EMAIL PROTECTED]
>

--

Robert Dyer
[EMAIL PROTECTED]
+
Robert Dyer 2013-02-24, 21:05