|
john smith
2011-09-15, 22:06
Aaron Baff
2011-09-15, 23:27
Raj Vishwanathan
2011-09-16, 01:46
Harsh J
2011-09-16, 04:33
john smith
2011-09-16, 04:45
Harsh J
2011-09-16, 05:35
Aaron Baff
2011-09-16, 16:46
john smith
2011-09-16, 17:03
Aaron Baff
2011-09-16, 17:18
|
-
Datanodes going down frequentlyjohn smith 2011-09-15, 22:06
Hi all,
I am running a 10 node cluster (1NN + 9DN, ubuntu server 10.04, 2GB RAM each). I am facing a strange problem. My datanodes go down randomly and nothing showup in the logs. They lose their network connectivity suddenly and NN declares them as dead. Any one faced this problem? Is it because of hadoop or is it some problem with my infrastructure? The worst part of the problem is, I need to manually go to the remote machine and restart networking. Can someone help me with this? Did any one face a similar kind of a problem Btw: my had version : 0.20.2 Thanks, jS
-
RE: Datanodes going down frequentlyAaron Baff 2011-09-15, 23:27
Do they eventually recover and get added back into the cluster? Do you have a ton of blocks? I've noticed that sometimes the block checker will take so long, and will tie up so much CPU (and memory, which hits a GC cycle) that it stops reporting to the NN for a while, but when it finishes its check, it resumes talking to the NN and the NN adds it back in.
--Aaron -----Original Message----- From: john smith [mailto:[EMAIL PROTECTED]] Sent: Thursday, September 15, 2011 3:07 PM To: [EMAIL PROTECTED] Subject: Datanodes going down frequently Hi all, I am running a 10 node cluster (1NN + 9DN, ubuntu server 10.04, 2GB RAM each). I am facing a strange problem. My datanodes go down randomly and nothing showup in the logs. They lose their network connectivity suddenly and NN declares them as dead. Any one faced this problem? Is it because of hadoop or is it some problem with my infrastructure? The worst part of the problem is, I need to manually go to the remote machine and restart networking. Can someone help me with this? Did any one face a similar kind of a problem Btw: my had version : 0.20.2 Thanks, jS
-
Re: Datanodes going down frequentlyRaj Vishwanathan 2011-09-16, 01:46
You have only 2 GB of ram? Have you cheked if you are swapping?
Raj Sent from my iPad On Sep 15, 2011, at 4:27 PM, Aaron Baff <[EMAIL PROTECTED]> wrote: > Do they eventually recover and get added back into the cluster? Do you have a ton of blocks? I've noticed that sometimes the block checker will take so long, and will tie up so much CPU (and memory, which hits a GC cycle) that it stops reporting to the NN for a while, but when it finishes its check, it resumes talking to the NN and the NN adds it back in. > > --Aaron > -----Original Message----- > From: john smith [mailto:[EMAIL PROTECTED]] > Sent: Thursday, September 15, 2011 3:07 PM > To: [EMAIL PROTECTED] > Subject: Datanodes going down frequently > > Hi all, > > I am running a 10 node cluster (1NN + 9DN, ubuntu server 10.04, 2GB RAM > each). I am facing a strange problem. My datanodes go down randomly and > nothing showup in the logs. They lose their network connectivity suddenly > and NN declares them as dead. Any one faced this problem? Is it because of > hadoop or is it some problem with my infrastructure? > > The worst part of the problem is, I need to manually go to the remote > machine and restart networking. Can someone help me with this? Did any one > face a similar kind of a problem > > Btw: my had version : 0.20.2 > > Thanks, > jS
-
Re: Datanodes going down frequentlyHarsh J 2011-09-16, 04:33
I bet its swapping. You may just be oversubscribing those machines
with your MR slots and heap per slot or otherwise. Could also be low heap given number of blocks its gotta report (which would equate to a small files issue given your cluster size possibly, but that's a different discussion). On Fri, Sep 16, 2011 at 3:36 AM, john smith <[EMAIL PROTECTED]> wrote: > Hi all, > > I am running a 10 node cluster (1NN + 9DN, ubuntu server 10.04, 2GB RAM > each). I am facing a strange problem. My datanodes go down randomly and > nothing showup in the logs. They lose their network connectivity suddenly > and NN declares them as dead. Any one faced this problem? Is it because of > hadoop or is it some problem with my infrastructure? > > The worst part of the problem is, I need to manually go to the remote > machine and restart networking. Can someone help me with this? Did any one > face a similar kind of a problem > > Btw: my had version : 0.20.2 > > Thanks, > jS > -- Harsh J
-
Re: Datanodes going down frequentlyjohn smith 2011-09-16, 04:45
Hi All,
Thanks for your inputs, @Aaron : No, they aren't recovering. They are losing network connectivity and they are not getting it back. I am unable to ssh to them and I need to manually go and restart the networking. @harsh and Raj, One thing I noticed in my hadoop-env.sh that "export HADOOP_HEAPSIZE=2000" . Isn't this strange? Allocating my whole ram to the JVM ? Should I consider this? Right now I am not running any MR jobs as such . I've started my cluster and I've put around 30 to 40GB of data with a replication factor of 3 . This takes the machines down. Looks like swapping issue .. But how to see if I am swapping or not? Any help? Thanks jS On Fri, Sep 16, 2011 at 10:03 AM, Harsh J <[EMAIL PROTECTED]> wrote: > I bet its swapping. You may just be oversubscribing those machines > with your MR slots and heap per slot or otherwise. Could also be low > heap given number of blocks its gotta report (which would equate to a > small files issue given your cluster size possibly, but that's a > different discussion). > > On Fri, Sep 16, 2011 at 3:36 AM, john smith <[EMAIL PROTECTED]> > wrote: > > Hi all, > > > > I am running a 10 node cluster (1NN + 9DN, ubuntu server 10.04, 2GB RAM > > each). I am facing a strange problem. My datanodes go down randomly and > > nothing showup in the logs. They lose their network connectivity suddenly > > and NN declares them as dead. Any one faced this problem? Is it because > of > > hadoop or is it some problem with my infrastructure? > > > > The worst part of the problem is, I need to manually go to the remote > > machine and restart networking. Can someone help me with this? Did any > one > > face a similar kind of a problem > > > > Btw: my had version : 0.20.2 > > > > Thanks, > > jS > > > > > > -- > Harsh J >
-
Re: Datanodes going down frequentlyHarsh J 2011-09-16, 05:35
John,
On Fri, Sep 16, 2011 at 10:15 AM, john smith <[EMAIL PROTECTED]> wrote: > Hi All, > > Thanks for your inputs, > > @Aaron : No, they aren't recovering. They are losing network connectivity > and they are not getting it back. I am unable to ssh to them and I need to > manually go and restart the networking. Ah so the machines itself fall off the grid? You have to 'reset' them, hardware-wise? What state do they lie under - are they still powered on but just unresponsive over the network? Also, only certain DNs die out this way? > @harsh and Raj, > > One thing I noticed in my hadoop-env.sh that "export HADOOP_HEAPSIZE=2000" > . Isn't this strange? Allocating my whole ram to the JVM ? Should I consider > this? Right now I am not running any MR jobs as such . You'll sorta need more RAM if you plan to make this into a work heavy cluster someday. 2 GB can soon become too low, assuming your OS also needs quite a bit of RAM for its operations. Assuming each slave node runs only the DataNode process, using HADOOP_HEAPSIZE=1000 should be OK to use. Else, scale it down to 700-500 or so (although that gets too low once you hit a few # of blocks). Given that your OS needs good RAM too, and as your DN starts growing in its blocks, you'll eventually run out of sufficient memory @ 2 GB - so its entirely dependent on what you're gonna be doing and how much data you'll be storing and how. You can monitor swapping using several tools. I find 'vmstat' to be a good one that tells me if swapping has occured. You can also setup tools like Nagios and Ganglia across cluster for these kind of tasks. -- Harsh J
-
RE: Datanodes going down frequentlyAaron Baff 2011-09-16, 16:46
John,
Are the machines simply unreachable? Or has the OS crashed? We've been having quite a few problems with our network mbufs filling up and not getting released, which causes a machine to eventually become unreachable via the network, although they are otherwise up and running fine. Can you attach a KVM to a machine when it becomes unreachable and take a look? Or add some monitoring to keep an eye on the network mbufs? Don't know if this is your problem as well or not. --Aaron -----Original Message----- From: john smith [mailto:[EMAIL PROTECTED]] Sent: Thursday, September 15, 2011 9:46 PM To: [EMAIL PROTECTED] Subject: Re: Datanodes going down frequently Hi All, Thanks for your inputs, @Aaron : No, they aren't recovering. They are losing network connectivity and they are not getting it back. I am unable to ssh to them and I need to manually go and restart the networking. @harsh and Raj, One thing I noticed in my hadoop-env.sh that "export HADOOP_HEAPSIZE=2000" . Isn't this strange? Allocating my whole ram to the JVM ? Should I consider this? Right now I am not running any MR jobs as such . I've started my cluster and I've put around 30 to 40GB of data with a replication factor of 3 . This takes the machines down. Looks like swapping issue .. But how to see if I am swapping or not? Any help? Thanks jS On Fri, Sep 16, 2011 at 10:03 AM, Harsh J <[EMAIL PROTECTED]> wrote: > I bet its swapping. You may just be oversubscribing those machines > with your MR slots and heap per slot or otherwise. Could also be low > heap given number of blocks its gotta report (which would equate to a > small files issue given your cluster size possibly, but that's a > different discussion). > > On Fri, Sep 16, 2011 at 3:36 AM, john smith <[EMAIL PROTECTED]> > wrote: > > Hi all, > > > > I am running a 10 node cluster (1NN + 9DN, ubuntu server 10.04, 2GB RAM > > each). I am facing a strange problem. My datanodes go down randomly and > > nothing showup in the logs. They lose their network connectivity suddenly > > and NN declares them as dead. Any one faced this problem? Is it because > of > > hadoop or is it some problem with my infrastructure? > > > > The worst part of the problem is, I need to manually go to the remote > > machine and restart networking. Can someone help me with this? Did any > one > > face a similar kind of a problem > > > > Btw: my had version : 0.20.2 > > > > Thanks, > > jS > > > > > > -- > Harsh J >
-
Re: Datanodes going down frequentlyjohn smith 2011-09-16, 17:03
Hi Aaron,
I haven't really run any MR jobs on my cluster till now. I've just been pushing data into the hdfs . So network shouldn't be a problem. Initially my HADOOP_HEAPSIZE was set to 2000MB and my ram size was 2GB . This resulted in datanodes going down randomly. I actually realized that the OS kept crashing and system went unresponsive until I manually power it on again. So I reduced the HADOOP_HEAPSIZE to 800MB and the cluster seems to be stable again and the datanodes are stable from the past few hours.(I am not sure though,I need to run a few heavy tasks to check it thoroughly). Looks like my problem wasn't with ethernet interface going down and its actually a full OS crash. I am not used to KVM , so i'll have to google it and i'll attach it to the datanodes and watch them closely incase they fail again in the future. What abt your cluster? Are you running any "suffle intense" jobs like JOINs or CROSS PRODUCTs ? Thanks On Fri, Sep 16, 2011 at 10:16 PM, Aaron Baff <[EMAIL PROTECTED]>wrote: > John, > > Are the machines simply unreachable? Or has the OS crashed? We've been > having quite a few problems with our network mbufs filling up and not > getting released, which causes a machine to eventually become unreachable > via the network, although they are otherwise up and running fine. Can you > attach a KVM to a machine when it becomes unreachable and take a look? Or > add some monitoring to keep an eye on the network mbufs? Don't know if this > is your problem as well or not. > > --Aaron > -----Original Message----- > From: john smith [mailto:[EMAIL PROTECTED]] > Sent: Thursday, September 15, 2011 9:46 PM > To: [EMAIL PROTECTED] > Subject: Re: Datanodes going down frequently > > Hi All, > > Thanks for your inputs, > > @Aaron : No, they aren't recovering. They are losing network connectivity > and they are not getting it back. I am unable to ssh to them and I need to > manually go and restart the networking. > > @harsh and Raj, > > One thing I noticed in my hadoop-env.sh that "export HADOOP_HEAPSIZE=2000" > . Isn't this strange? Allocating my whole ram to the JVM ? Should I > consider > this? Right now I am not running any MR jobs as such . > > I've started my cluster and I've put around 30 to 40GB of data with a > replication factor of 3 . This takes the machines down. Looks like swapping > issue .. But how to see if I am swapping or not? Any help? > > Thanks > jS > > On Fri, Sep 16, 2011 at 10:03 AM, Harsh J <[EMAIL PROTECTED]> wrote: > > > I bet its swapping. You may just be oversubscribing those machines > > with your MR slots and heap per slot or otherwise. Could also be low > > heap given number of blocks its gotta report (which would equate to a > > small files issue given your cluster size possibly, but that's a > > different discussion). > > > > On Fri, Sep 16, 2011 at 3:36 AM, john smith <[EMAIL PROTECTED]> > > wrote: > > > Hi all, > > > > > > I am running a 10 node cluster (1NN + 9DN, ubuntu server 10.04, 2GB RAM > > > each). I am facing a strange problem. My datanodes go down randomly and > > > nothing showup in the logs. They lose their network connectivity > suddenly > > > and NN declares them as dead. Any one faced this problem? Is it because > > of > > > hadoop or is it some problem with my infrastructure? > > > > > > The worst part of the problem is, I need to manually go to the remote > > > machine and restart networking. Can someone help me with this? Did any > > one > > > face a similar kind of a problem > > > > > > Btw: my had version : 0.20.2 > > > > > > Thanks, > > > jS > > > > > > > > > > > -- > > Harsh J > > >
-
RE: Datanodes going down frequentlyAaron Baff 2011-09-16, 17:18
By KVM I was referring to Keyboard-Video-Mouse console. Basically a cart with a monitor, mouse & keyboard that you plug into a server for console access.
Ah, yes, it does sound like your OS was having problems with memory then. We're not generally having problems with MR Jobs per-se, but it _appears_ that there is something going on when doing HDFS accesses. Most of our Jobs use a custom grouping & sorting comparators, but they aren't joins so probably not too intensive. Our newer cluster we are going to be using from now on is CDH3u1, and from the mailing list they don't really have a clue why we're seeing this behavior. We're running on FreeBSD with the Diablo-JVM (Java 1.6), which a guy on their list feels is a pretty unusual configuration that people aren't really running. --Aaron -----Original Message----- From: john smith [mailto:[EMAIL PROTECTED]] Sent: Friday, September 16, 2011 10:04 AM To: [EMAIL PROTECTED] Subject: Re: Datanodes going down frequently Hi Aaron, I haven't really run any MR jobs on my cluster till now. I've just been pushing data into the hdfs . So network shouldn't be a problem. Initially my HADOOP_HEAPSIZE was set to 2000MB and my ram size was 2GB . This resulted in datanodes going down randomly. I actually realized that the OS kept crashing and system went unresponsive until I manually power it on again. So I reduced the HADOOP_HEAPSIZE to 800MB and the cluster seems to be stable again and the datanodes are stable from the past few hours.(I am not sure though,I need to run a few heavy tasks to check it thoroughly). Looks like my problem wasn't with ethernet interface going down and its actually a full OS crash. I am not used to KVM , so i'll have to google it and i'll attach it to the datanodes and watch them closely incase they fail again in the future. What abt your cluster? Are you running any "suffle intense" jobs like JOINs or CROSS PRODUCTs ? Thanks On Fri, Sep 16, 2011 at 10:16 PM, Aaron Baff <[EMAIL PROTECTED]>wrote: > John, > > Are the machines simply unreachable? Or has the OS crashed? We've been > having quite a few problems with our network mbufs filling up and not > getting released, which causes a machine to eventually become unreachable > via the network, although they are otherwise up and running fine. Can you > attach a KVM to a machine when it becomes unreachable and take a look? Or > add some monitoring to keep an eye on the network mbufs? Don't know if this > is your problem as well or not. > > --Aaron > -----Original Message----- > From: john smith [mailto:[EMAIL PROTECTED]] > Sent: Thursday, September 15, 2011 9:46 PM > To: [EMAIL PROTECTED] > Subject: Re: Datanodes going down frequently > > Hi All, > > Thanks for your inputs, > > @Aaron : No, they aren't recovering. They are losing network connectivity > and they are not getting it back. I am unable to ssh to them and I need to > manually go and restart the networking. > > @harsh and Raj, > > One thing I noticed in my hadoop-env.sh that "export HADOOP_HEAPSIZE=2000" > . Isn't this strange? Allocating my whole ram to the JVM ? Should I > consider > this? Right now I am not running any MR jobs as such . > > I've started my cluster and I've put around 30 to 40GB of data with a > replication factor of 3 . This takes the machines down. Looks like swapping > issue .. But how to see if I am swapping or not? Any help? > > Thanks > jS > > On Fri, Sep 16, 2011 at 10:03 AM, Harsh J <[EMAIL PROTECTED]> wrote: > > > I bet its swapping. You may just be oversubscribing those machines > > with your MR slots and heap per slot or otherwise. Could also be low > > heap given number of blocks its gotta report (which would equate to a > > small files issue given your cluster size possibly, but that's a > > different discussion). > > > > On Fri, Sep 16, 2011 at 3:36 AM, john smith <[EMAIL PROTECTED]> > > wrote: > > > Hi all, > > > > > > I am running a 10 node cluster (1NN + 9DN, ubuntu server 10.04, 2GB RAM |