Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce, mail # user - Help tuning a cluster - COPY slow


+
Tim Robertson 2010-11-17, 08:43
+
Friso van Vollenhoven 2010-11-17, 09:20
+
Tim Robertson 2010-11-17, 20:50
+
Aaron Kimball 2010-11-17, 21:08
+
Tim Robertson 2010-11-17, 21:20
+
Friso van Vollenhoven 2010-11-18, 09:19
+
Tim Robertson 2010-11-18, 10:04
Copy link to this message
-
Re: Help tuning a cluster - COPY slow
Tim Robertson 2010-11-18, 13:38
Just to close this thread.
Turns out it all came down to a mapred.reduce.parallel.copies being
overwritten to 5 on the Hive submission. Cranking that back up and
everything is happy again.

Thanks for the ideas,

Tim
On Thu, Nov 18, 2010 at 11:04 AM, Tim Robertson
<[EMAIL PROTECTED]> wrote:
> Thanks again.
>
> We are getting closer to debugging this.  Our reference for all these
> tests was a simple GroupBy using Hive, but when I do a vanilla MR job
> on the tab file input to do the same group by, it flies through -
> almost exactly 2 times quicker.  Investigating further as it is not
> quite a fair test at the moment due to some config differences...
>
>
> On Thu, Nov 18, 2010 at 10:19 AM, Friso van Vollenhoven
> <[EMAIL PROTECTED]> wrote:
>> Do you have IPv6 enabled on the boxes? If DNS gives both IPv4 and IPv6 results for lookups, Java will try v6 first and then fall back to v4, which is an additional connect attempt. You can force Java to use only v4 by setting the system property java.net.preferIPv4Stack=true.
>>
>> Also, I am not sure whether Java does the same thing as nslookup when doing name lookups (I believe it has its own cache as well, but correct me if I'm wrong).
>>
>> You could try running something like strace (with the -T option, which shows time spent in system calls) to see whether network related system calls take a long time.
>>
>>
>>
>> Friso
>>
>>
>>
>>
>> On 17 nov 2010, at 22:20, Tim Robertson wrote:
>>
>>> I don't think so Aaron - but we use names not IPs in the config and on
>>> a node the following is instant:
>>>
>>> [root@c2n1 ~]# nslookup c1n1.gbif.org
>>> Server:               130.226.238.254
>>> Address:      130.226.238.254#53
>>>
>>> Non-authoritative answer:
>>> Name: c1n1.gbif.org
>>> Address: 130.226.238.171
>>>
>>> If I ssh onto an arbitrary machine in the cluster and pull a file
>>> using curl (e.g.
>>> http://c1n9.gbif.org:50075/streamFile?filename=%2Fuser%2Fhive%2Fwarehouse%2Feol_density2_4%2Fattempt_201011151423_0027_m_000000_0&delegation=null)
>>> it comes down at 110M/s with no delay on DNS lookup.
>>>
>>> Is there a better test I can do? - I am not so much a network guy...
>>> Cheers,
>>> Tim
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Nov 17, 2010 at 10:08 PM, Aaron Kimball <[EMAIL PROTECTED]> wrote:
>>>> Tim,
>>>> Are there issues with DNS caching (or lack thereof), misconfigured
>>>> /etc/hosts, or other network-config gotchas that might be preventing network
>>>> connections between hosts from opening efficiently?
>>>> - Aaron
>>>>
>>>> On Wed, Nov 17, 2010 at 12:50 PM, Tim Robertson <[EMAIL PROTECTED]>
>>>> wrote:
>>>>>
>>>>> Thanks Friso,
>>>>>
>>>>> We've been trying to diagnose all day and still did not find a solution.
>>>>> We're running cacti and IO wait is down at 0.5%, M&R are tuned right
>>>>> down to 1M 1R on each machine, and the machine CPUs are almost idle
>>>>> with no swap.
>>>>> Using curl to pull a file from a DN comes down at 110m/s.
>>>>>
>>>>> We are now upping things like epoll
>>>>>
>>>>> Any ideas really greatly appreciated at this stage!
>>>>> Tim
>>>>>
>>>>>
>>>>> On Wed, Nov 17, 2010 at 10:20 AM, Friso van Vollenhoven
>>>>> <[EMAIL PROTECTED]> wrote:
>>>>>> Hi Tim,
>>>>>> Getting 28K of map outputs to reducers should not take minutes. Reducers
>>>>>> on
>>>>>> a properly setup (1Gb) network should be copying at multiple MB/s. I
>>>>>> think
>>>>>> you need to get some more info.
>>>>>> Apart from top, you'll probably also want to look at iostat and vmstat.
>>>>>> The
>>>>>> first will tell you something about disk utilization and the latter can
>>>>>> tell
>>>>>> you whether the machines are using swap or not. This is very important.
>>>>>> If
>>>>>> you are over utilizing physical memory on the machines, thing will be
>>>>>> slow.
>>>>>> It's even better if you put something in place that allows you to get an
>>>>>> overall view of the resource usage across the cluster. Look at Ganglia
>>>>>> (http://ganglia.sourceforge.net/) or Cacti (http://www.cacti.net/) or