Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Re: DataNode and Tasttracker communication


Copy link to this message
-
Re: DataNode and Tasttracker communication
Michael Segel 2012-08-13, 14:59
0.0.0.0 means that the call is going to all interfaces on the machine.  (Shouldn't be an issue...)

IPv4 vs IPv6? Could be an issue, however OP says he can write data to DNs and they seem to communicate, therefore if its IPv6 related, wouldn't it impact all traffic and not just a specific port?
I agree... shut down IPv6 if you can.

I don't disagree with your assessment. I am just suggesting that before you do a really deep dive, you think about the more obvious stuff first.

There are a couple of other things... like do all of the /etc/hosts files on all of the machines match?
Is the OP using both /etc/hosts and DNS? If so, are they in sync?

BTW, you said DNS in your response. if you're using DNS, then you don't really want to have much info in the /etc/hosts file except loopback and the server's IP address.

Looking at the problem OP is indicating some traffic works, while other traffic doesn't. Most likely something is blocking the ports. Iptables is the first place to look.

Just saying. ;-)
On Aug 13, 2012, at 9:12 AM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:

> Hi Michael,
>        I asked for hosts file because there seems to be some loopback prob to me. The log shows that call is going at 0.0.0.0. Apart from what you have said, I think disabling IPv6 and making sure that there is no prob with the DNS resolution is also necessary. Please correct me if I am wrong. Thank you.
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Mon, Aug 13, 2012 at 7:06 PM, Michael Segel <[EMAIL PROTECTED]> wrote:
> Based on your /etc/hosts output, why aren't you using DNS?
>
> Outside of MapR, multihomed machines can be problematic. Hadoop doesn't generally work well when you're not using the FQDN or its alias.
>
> The issue isn't the SSH, but if you go to the node which is having trouble connecting to another node,  then try to ping it, or some other general communication,  if it succeeds, your issue is that the port you're trying to communicate with is blocked.  Then its more than likely an ipconfig or firewall issue.
>
> On Aug 13, 2012, at 8:17 AM, Björn-Elmar Macek <[EMAIL PROTECTED]> wrote:
>
>> Hi Michael,
>>
>> well i can ssh from any node to any other without being prompted. The reason for this is, that my home dir is mounted in every server in the cluster.
>>
>> If the machines are multihomed: i dont know. i could ask if this would be of importance.
>>
>> Shall i?
>>
>> Regards,
>> Elmar
>>
>> Am 13.08.12 14:59, schrieb Michael Segel:
>>> If the nodes can communicate and distribute data, then the odds are that the issue isn't going to be in his /etc/hosts.
>>>
>>> A more relevant question is if he's running a firewall on each of these machines?
>>>
>>> A simple test... ssh to one node, ping other nodes and the control nodes at random to see if they can see one another. Then check to see if there is a firewall running which would limit the types of traffic between nodes.
>>>
>>> One other side note... are these machines multi-homed?
>>>
>>> On Aug 13, 2012, at 7:51 AM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:
>>>
>>>> Hello there,
>>>>
>>>>      Could you please share your /etc/hosts file, if you don't mind.
>>>>
>>>> Regards,
>>>>     Mohammad Tariq
>>>>
>>>>
>>>>
>>>> On Mon, Aug 13, 2012 at 6:01 PM, Björn-Elmar Macek <[EMAIL PROTECTED]> wrote:
>>>> Hi,
>>>>
>>>> i am currently trying to run my hadoop program on a cluster. Sadly though my datanodes and tasktrackers seem to have difficulties with their communication as their logs say:
>>>> * Some datanodes and tasktrackers seem to have portproblems of some kind as it can be seen in the logs below. I wondered if this might be due to reasons correllated with the localhost entry in /etc/hosts as you can read in alot of posts with similar errors, but i checked the file neither localhost nor 127.0.0.1/127.0.1.1 is bound there. (although you can ping localhost... the technician of the cluster said he'd be looking for the mechanics resolving localhost)