|
|
-
HBase 0.90.0 region servers dying
Enis Soztutar 2011-02-16, 08:40
Hi,
We have a newly setup a cluster of 5 nodes, each with 16 GB rams. We use HBase 0.90.0 on top of Hadoop from CDH3. When testing HBase under heavy load generated bu YCSB, we consistently see region servers dying silently, without any logs or exceptions (not even in system logs). We couldn't track down the problem, so we have tested the same setup on a rackspace cluster with 7 nodes but similar hardware, and we didn't have any problem.
We are suspecting a problem with the rams, or motherboards, but all memory tests run successfully. I was wondering if anyone had similar problems before and is there anything you suggest to nail down the issue.
Thanks, Enis
-
Re: HBase 0.90.0 region servers dying
Ryan Rawson 2011-02-16, 08:46
are your disks filling? are you running into swap? vmstat can help diagnose this.
What is 'heavy load'. I pushed a 3 node hbase cluster to 18-24k ops/sec and I didn't feel like it hit hard enough.
Also what was the tail of the log looking like? Any "FATAL" or "ERROR" strings? Or grep for Exception?
-ryan
On Wed, Feb 16, 2011 at 12:40 AM, Enis Soztutar <[EMAIL PROTECTED]> wrote: > Hi, > > We have a newly setup a cluster of 5 nodes, each with 16 GB rams. We use > HBase 0.90.0 on top of Hadoop from CDH3. When testing HBase under heavy load > generated bu YCSB, we consistently see region servers dying silently, > without any logs or exceptions (not even in system logs). We couldn't track > down the problem, so we have tested the same setup on a rackspace cluster > with 7 nodes but similar hardware, and we didn't have any problem. > > We are suspecting a problem with the rams, or motherboards, but all memory > tests run successfully. I was wondering if anyone had similar problems > before and is there anything you suggest to nail down the issue. > > Thanks, > Enis >
-
Re: HBase 0.90.0 region servers dying
Ted Dunning 2011-02-16, 09:00
Are the nodes themselves dying or the region server processes?
What JVM version?
On Wed, Feb 16, 2011 at 12:46 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote:
> are your disks filling? are you running into swap? vmstat can help > diagnose this. > > What is 'heavy load'. I pushed a 3 node hbase cluster to 18-24k > ops/sec and I didn't feel like it hit hard enough. > > Also what was the tail of the log looking like? Any "FATAL" or "ERROR" > strings? Or grep for Exception? > > -ryan > > On Wed, Feb 16, 2011 at 12:40 AM, Enis Soztutar > <[EMAIL PROTECTED]> wrote: > > Hi, > > > > We have a newly setup a cluster of 5 nodes, each with 16 GB rams. We use > > HBase 0.90.0 on top of Hadoop from CDH3. When testing HBase under heavy > load > > generated bu YCSB, we consistently see region servers dying silently, > > without any logs or exceptions (not even in system logs). We couldn't > track > > down the problem, so we have tested the same setup on a rackspace > cluster > > with 7 nodes but similar hardware, and we didn't have any problem. > > > > We are suspecting a problem with the rams, or motherboards, but all > memory > > tests run successfully. I was wondering if anyone had similar problems > > before and is there anything you suggest to nail down the issue. > > > > Thanks, > > Enis > > >
-
Re: HBase 0.90.0 region servers dying
Eric 2011-02-16, 13:13
Did you increase the max open files on your system (in /etc/security/limits.conf) ?
2011/2/16 Enis Soztutar <[EMAIL PROTECTED]>
> Hi, > > We have a newly setup a cluster of 5 nodes, each with 16 GB rams. We use > HBase 0.90.0 on top of Hadoop from CDH3. When testing HBase under heavy > load > generated bu YCSB, we consistently see region servers dying silently, > without any logs or exceptions (not even in system logs). We couldn't track > down the problem, so we have tested the same setup on a rackspace cluster > with 7 nodes but similar hardware, and we didn't have any problem. > > We are suspecting a problem with the rams, or motherboards, but all memory > tests run successfully. I was wondering if anyone had similar problems > before and is there anything you suggest to nail down the issue. > > Thanks, > Enis >
-
Re: HBase 0.90.0 region servers dying
Enis Soztutar 2011-02-18, 06:14
Hi,
Thanks everyone for the answers. I had already increase the file descriptors to 32768. The region servers and the zookeeper processes are dying, but datanode and tasktrackers keep running (they are configured with a max heap of 1Gb). The logs do not contain any indication that something is going wrong. The last info on the logs are typical INFO level logs. I have also checked for kernel logs, but kernel does not report that it is killing the processes either. While testing, two of the servers restarted at different times, which was the original reason that I had suspected a memory error. But after we replaced the power supplies, nodes did not restart, but the processes kept dying.
For the load, the ycsb test for 10M records goes on for a while at 4K inserts per sec, but cannot complete due to region servers dying one by one. iostat also shows light cpu and io utilization around 20%. Any more suggestions for debugging would be more than welcome.
Thanks, Enis
On Wed, Feb 16, 2011 at 5:13 AM, Eric <[EMAIL PROTECTED]> wrote:
> Did you increase the max open files on your system (in > /etc/security/limits.conf) ? >
> 2011/2/16 Enis Soztutar <[EMAIL PROTECTED]> > > > Hi, > > > > We have a newly setup a cluster of 5 nodes, each with 16 GB rams. We use > > HBase 0.90.0 on top of Hadoop from CDH3. When testing HBase under heavy > > load > > generated bu YCSB, we consistently see region servers dying silently, > > without any logs or exceptions (not even in system logs). We couldn't > track > > down the problem, so we have tested the same setup on a rackspace > cluster > > with 7 nodes but similar hardware, and we didn't have any problem. > > > > We are suspecting a problem with the rams, or motherboards, but all > memory > > tests run successfully. I was wondering if anyone had similar problems > > before and is there anything you suggest to nail down the issue. > > > > Thanks, > > Enis > > >
-
Re: HBase 0.90.0 region servers dying
Jean-Daniel Cryans 2011-02-18, 19:50
Just to make sure, you did check in the .out file after a failure right?
J-D
On Thu, Feb 17, 2011 at 10:14 PM, Enis Soztutar <[EMAIL PROTECTED]> wrote: > Hi, > > Thanks everyone for the answers. > I had already increase the file descriptors to 32768. The region servers > and the zookeeper processes are dying, but datanode and tasktrackers keep > running (they are configured with a max heap of 1Gb). The logs do not > contain any indication that something is going wrong. The last info on the > logs are typical INFO level logs. I have also checked for kernel logs, but > kernel does not report that it is killing the processes either. While > testing, two of the servers restarted at different times, which was the > original reason that I had suspected a memory error. But after we replaced > the power supplies, nodes did not restart, but the processes kept dying. > > For the load, the ycsb test for 10M records goes on for a while at 4K > inserts per sec, but cannot complete due to region servers dying one by one. > iostat also shows light cpu and io utilization around 20%. Any more > suggestions for debugging would be more than welcome. > > Thanks, > Enis > > On Wed, Feb 16, 2011 at 5:13 AM, Eric <[EMAIL PROTECTED]> wrote: > >> Did you increase the max open files on your system (in >> /etc/security/limits.conf) ? >> > >> 2011/2/16 Enis Soztutar <[EMAIL PROTECTED]> >> >> > Hi, >> > >> > We have a newly setup a cluster of 5 nodes, each with 16 GB rams. We use >> > HBase 0.90.0 on top of Hadoop from CDH3. When testing HBase under heavy >> > load >> > generated bu YCSB, we consistently see region servers dying silently, >> > without any logs or exceptions (not even in system logs). We couldn't >> track >> > down the problem, so we have tested the same setup on a rackspace >> cluster >> > with 7 nodes but similar hardware, and we didn't have any problem. >> > >> > We are suspecting a problem with the rams, or motherboards, but all >> memory >> > tests run successfully. I was wondering if anyone had similar problems >> > before and is there anything you suggest to nail down the issue. >> > >> > Thanks, >> > Enis >> > >> >
-
Re: HBase 0.90.0 region servers dying
Enis Soztutar 2011-02-19, 08:58
Yes indeed but no luck.
Enis
On Fri, Feb 18, 2011 at 11:50 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote:
> Just to make sure, you did check in the .out file after a failure right? > > J-D > > On Thu, Feb 17, 2011 at 10:14 PM, Enis Soztutar > <[EMAIL PROTECTED]> wrote: > > Hi, > > > > Thanks everyone for the answers. > > I had already increase the file descriptors to 32768. The region servers > > and the zookeeper processes are dying, but datanode and tasktrackers keep > > running (they are configured with a max heap of 1Gb). The logs do not > > contain any indication that something is going wrong. The last info on > the > > logs are typical INFO level logs. I have also checked for kernel logs, > but > > kernel does not report that it is killing the processes either. While > > testing, two of the servers restarted at different times, which was the > > original reason that I had suspected a memory error. But after we > replaced > > the power supplies, nodes did not restart, but the processes kept dying. > > > > For the load, the ycsb test for 10M records goes on for a while at 4K > > inserts per sec, but cannot complete due to region servers dying one by > one. > > iostat also shows light cpu and io utilization around 20%. Any more > > suggestions for debugging would be more than welcome. > > > > Thanks, > > Enis > > > > On Wed, Feb 16, 2011 at 5:13 AM, Eric <[EMAIL PROTECTED]> wrote: > > > >> Did you increase the max open files on your system (in > >> /etc/security/limits.conf) ? > >> > > > >> 2011/2/16 Enis Soztutar <[EMAIL PROTECTED]> > >> > >> > Hi, > >> > > >> > We have a newly setup a cluster of 5 nodes, each with 16 GB rams. We > use > >> > HBase 0.90.0 on top of Hadoop from CDH3. When testing HBase under > heavy > >> > load > >> > generated bu YCSB, we consistently see region servers dying silently, > >> > without any logs or exceptions (not even in system logs). We couldn't > >> track > >> > down the problem, so we have tested the same setup on a rackspace > >> cluster > >> > with 7 nodes but similar hardware, and we didn't have any problem. > >> > > >> > We are suspecting a problem with the rams, or motherboards, but all > >> memory > >> > tests run successfully. I was wondering if anyone had similar problems > >> > before and is there anything you suggest to nail down the issue. > >> > > >> > Thanks, > >> > Enis > >> > > >> > > >
-
Re: HBase 0.90.0 region servers dying
Jean-Daniel Cryans 2011-02-22, 20:34
Ted asked about the JVM version but I don't think you answered that. In any case, try with u17.
J-D
On Sat, Feb 19, 2011 at 3:58 AM, Enis Soztutar <[EMAIL PROTECTED]> wrote: > Yes indeed but no luck. > > Enis > > On Fri, Feb 18, 2011 at 11:50 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]> > wrote: >> >> Just to make sure, you did check in the .out file after a failure right? >> >> J-D >> >> On Thu, Feb 17, 2011 at 10:14 PM, Enis Soztutar >> <[EMAIL PROTECTED]> wrote: >> > Hi, >> > >> > Thanks everyone for the answers. >> > I had already increase the file descriptors to 32768. The region >> > servers >> > and the zookeeper processes are dying, but datanode and tasktrackers >> > keep >> > running (they are configured with a max heap of 1Gb). The logs do not >> > contain any indication that something is going wrong. The last info on >> > the >> > logs are typical INFO level logs. I have also checked for kernel logs, >> > but >> > kernel does not report that it is killing the processes either. While >> > testing, two of the servers restarted at different times, which was the >> > original reason that I had suspected a memory error. But after we >> > replaced >> > the power supplies, nodes did not restart, but the processes kept dying. >> > >> > For the load, the ycsb test for 10M records goes on for a while at 4K >> > inserts per sec, but cannot complete due to region servers dying one by >> > one. >> > iostat also shows light cpu and io utilization around 20%. Any more >> > suggestions for debugging would be more than welcome. >> > >> > Thanks, >> > Enis >> > >> > On Wed, Feb 16, 2011 at 5:13 AM, Eric <[EMAIL PROTECTED]> wrote: >> > >> >> Did you increase the max open files on your system (in >> >> /etc/security/limits.conf) ? >> >> >> > >> >> 2011/2/16 Enis Soztutar <[EMAIL PROTECTED]> >> >> >> >> > Hi, >> >> > >> >> > We have a newly setup a cluster of 5 nodes, each with 16 GB rams. We >> >> > use >> >> > HBase 0.90.0 on top of Hadoop from CDH3. When testing HBase under >> >> > heavy >> >> > load >> >> > generated bu YCSB, we consistently see region servers dying silently, >> >> > without any logs or exceptions (not even in system logs). We couldn't >> >> track >> >> > down the problem, so we have tested the same setup on a rackspace >> >> cluster >> >> > with 7 nodes but similar hardware, and we didn't have any problem. >> >> > >> >> > We are suspecting a problem with the rams, or motherboards, but all >> >> memory >> >> > tests run successfully. I was wondering if anyone had similar >> >> > problems >> >> > before and is there anything you suggest to nail down the issue. >> >> > >> >> > Thanks, >> >> > Enis >> >> > >> >> >> > > >
-
Re: HBase 0.90.0 region servers dying
Stack 2011-02-22, 21:18
Regionservers AND zookeeper nodes dying yet it ran fine on another cluster is a little mysterious, especially when nothing in logs -- system, hbase or zookeeper logs. It sounds like hardware issues but you'd usually see some sort of complaint logged. The processes just go away?
St.Ack
On Sat, Feb 19, 2011 at 12:58 AM, Enis Soztutar <[EMAIL PROTECTED]> wrote: > Yes indeed but no luck. > > Enis > > On Fri, Feb 18, 2011 at 11:50 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > >> Just to make sure, you did check in the .out file after a failure right? >> >> J-D >> >> On Thu, Feb 17, 2011 at 10:14 PM, Enis Soztutar >> <[EMAIL PROTECTED]> wrote: >> > Hi, >> > >> > Thanks everyone for the answers. >> > I had already increase the file descriptors to 32768. The region servers >> > and the zookeeper processes are dying, but datanode and tasktrackers keep >> > running (they are configured with a max heap of 1Gb). The logs do not >> > contain any indication that something is going wrong. The last info on >> the >> > logs are typical INFO level logs. I have also checked for kernel logs, >> but >> > kernel does not report that it is killing the processes either. While >> > testing, two of the servers restarted at different times, which was the >> > original reason that I had suspected a memory error. But after we >> replaced >> > the power supplies, nodes did not restart, but the processes kept dying. >> > >> > For the load, the ycsb test for 10M records goes on for a while at 4K >> > inserts per sec, but cannot complete due to region servers dying one by >> one. >> > iostat also shows light cpu and io utilization around 20%. Any more >> > suggestions for debugging would be more than welcome. >> > >> > Thanks, >> > Enis >> > >> > On Wed, Feb 16, 2011 at 5:13 AM, Eric <[EMAIL PROTECTED]> wrote: >> > >> >> Did you increase the max open files on your system (in >> >> /etc/security/limits.conf) ? >> >> >> > >> >> 2011/2/16 Enis Soztutar <[EMAIL PROTECTED]> >> >> >> >> > Hi, >> >> > >> >> > We have a newly setup a cluster of 5 nodes, each with 16 GB rams. We >> use >> >> > HBase 0.90.0 on top of Hadoop from CDH3. When testing HBase under >> heavy >> >> > load >> >> > generated bu YCSB, we consistently see region servers dying silently, >> >> > without any logs or exceptions (not even in system logs). We couldn't >> >> track >> >> > down the problem, so we have tested the same setup on a rackspace >> >> cluster >> >> > with 7 nodes but similar hardware, and we didn't have any problem. >> >> > >> >> > We are suspecting a problem with the rams, or motherboards, but all >> >> memory >> >> > tests run successfully. I was wondering if anyone had similar problems >> >> > before and is there anything you suggest to nail down the issue. >> >> > >> >> > Thanks, >> >> > Enis >> >> > >> >> >> > >> >
|
|