Thanks everyone for the answers.
I had already increase the file descriptors to 32768. The region servers
and the zookeeper processes are dying, but datanode and tasktrackers keep
running (they are configured with a max heap of 1Gb). The logs do not
contain any indication that something is going wrong. The last info on the
logs are typical INFO level logs. I have also checked for kernel logs, but
kernel does not report that it is killing the processes either. While
testing, two of the servers restarted at different times, which was the
original reason that I had suspected a memory error. But after we replaced
the power supplies, nodes did not restart, but the processes kept dying.
For the load, the ycsb test for 10M records goes on for a while at 4K
inserts per sec, but cannot complete due to region servers dying one by one.
iostat also shows light cpu and io utilization around 20%. Any more
suggestions for debugging would be more than welcome.
On Wed, Feb 16, 2011 at 5:13 AM, Eric <[EMAIL PROTECTED]> wrote:
> Did you increase the max open files on your system (in
> /etc/security/limits.conf) ?
> 2011/2/16 Enis Soztutar <[EMAIL PROTECTED]>
> > Hi,
> > We have a newly setup a cluster of 5 nodes, each with 16 GB rams. We use
> > HBase 0.90.0 on top of Hadoop from CDH3. When testing HBase under heavy
> > load
> > generated bu YCSB, we consistently see region servers dying silently,
> > without any logs or exceptions (not even in system logs). We couldn't
> > down the problem, so we have tested the same setup on a rackspace
> > with 7 nodes but similar hardware, and we didn't have any problem.
> > We are suspecting a problem with the rams, or motherboards, but all
> > tests run successfully. I was wondering if anyone had similar problems
> > before and is there anything you suggest to nail down the issue.
> > Thanks,
> > Enis