Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> scanner lease expired/region server shutdown

Copy link to this message
RE: scanner lease expired/region server shutdown

Looking further up in the logs (about 20 minutes prior in the logs when errors first started happening), I noticed the following.

btw, ulimit -a shows that I have "open files" set to 64k. Is that not sufficient?

2010-01-25 11:10:21,774 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(, storageID=DS-1418567969-, infoPort=50075, ipcPort=50020):Data\
XceiveServer: java.io.IOException: Too many open files
        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145)
        at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:84)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:130)
        at java.lang.Thread.run(Thread.java:619)

2010-01-25 11:10:21,566 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(, storageID=DS-1418567969-, infoPort=50075, ipcPort=50020):Got \
exception while serving blk_3332344970774019423_10249 to /
java.io.FileNotFoundException: /mnt/d1/HDFS-kannan1/current/subdir23/blk_3332344970774019423_10249.meta (Too many open files)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.<init>(FileInputStream.java:106)
        at org.apache.hadoop.hdfs.server.datanode.FSDataset.getMetaDataInputStream(FSDataset.java:682)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:97)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:172)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
        at java.lang.Thread.run(Thread.java:619)

From: Kannan Muthukkaruppan [[EMAIL PROTECTED]]
Sent: Tuesday, January 26, 2010 7:01 AM
Subject: RE: scanner lease expired/region server shutdown

1. Yes, it is a 5 node, setup.

1 Name Node/4 Data Nodes. Of the 4 DN, one is running the HBase Master, and the other three are running region servers. ZK is on all the same 5 nodes. Should ideally have separated this out. The nodes are 16GB, 4 disk machines.

2. I examined the HDFS datanode log on the same machine around that time the problems happened, and saw this:

2010-01-25 11:33:09,531 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(, storageID=DS-1418567969-, infoPort=50075, ipcPort=\
50020):Got exception while serving blk_5691809099673541164_10475 to /
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/ remote=/\
        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
        at java.lang.Thread.run(Thread.java:619)

[For future runs, will try to turn on Java GC logging as well as per your suggestion.]

3.  You wrote <<< There is no line after the above?  A region with a startkey of 0031841132?  >>>

I just cut-paste a section of the PE log and the scan of the '.META.' from shell. But there were many such regions for which the PE client reported errors of the form:

10/01/25 12:31:37 DEBUG client.HConnectionManager$TableServers: locateRegionInMeta attempt 1 of 10 failed; retrying after sleep of 1000 because: Connection refused
10/01/25 12:31:37 DEBUG client.HConnectionManager$TableServers: Removed .META.,,1 for tableName=.META. from cache because of TestTable,0015411453,99999999999999
10/01/25 12:31:37 DEBUG client.HConnectionManager$TableServers: Cached location for .META.,,1 is

and scan of '.META.' revealed that those regions, such as the 'TestTable,0015411453,99999999999999', had splitA & splitB portions in META.

<<< Then its a dropped edit of .META.  The parent region -- the one being split got updates which added the splitA and splitB and its offlining but the new region didn't get inserted?  The crash happened just at that time?>>>

Didn't fully understand the above.

<<< It would be interesting to take a look at the regionserver logs from that time.  Please post if you have a moment so we can take a looksee.>>>

Will do. Should I just send it as an attachment to the list? Or is their a recommended way of doing this?

Sent: Monday, January 25, 2010 8:59 PM
Subject: Re: scanner lease expired/region server shutdown

What J-D said but with ornamentation.  See below.

On Mon, Jan 25, 2010 at 7:14 PM, Kannan Muthukkaruppan

Whats the cluster your hitting like?  That 5-node thingy?  Whats the
hardware profile.

This is saying that it took 65 seconds to append to hdfs.  What was
going on at that time?  A fat GC in the regionserver or over in a
Datanode?  You can enable GC logging uncommenting stuff in the
hbase-env.sh.   Feed the GC log to https://gchisto.dev.java.net/
(suggested by the zookeeper lads).  Its good for finding the long
pauses.  We should find the logs around the long GC pause.  Its
probably a failed