Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # dev >> scanner lease expired/region server shutdown


+
Kannan Muthukkaruppan 2010-01-26, 03:14
+
Jean-Daniel Cryans 2010-01-26, 03:23
+
Stack 2010-01-26, 04:59
+
Kannan Muthukkaruppan 2010-01-26, 15:01
Copy link to this message
-
RE: scanner lease expired/region server shutdown

Looking further up in the logs (about 20 minutes prior in the logs when errors first started happening), I noticed the following.

btw, ulimit -a shows that I have "open files" set to 64k. Is that not sufficient?

2010-01-25 11:10:21,774 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.129.68.212:50010, storageID=DS-1418567969-10.129.68.212-50010-1263610251776, infoPort=50075, ipcPort=50020):Data\
XceiveServer: java.io.IOException: Too many open files
        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145)
        at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:84)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:130)
        at java.lang.Thread.run(Thread.java:619)

2010-01-25 11:10:21,566 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.129.68.212:50010, storageID=DS-1418567969-10.129.68.212-50010-1263610251776, infoPort=50075, ipcPort=50020):Got \
exception while serving blk_3332344970774019423_10249 to /10.129.68.212:
java.io.FileNotFoundException: /mnt/d1/HDFS-kannan1/current/subdir23/blk_3332344970774019423_10249.meta (Too many open files)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.<init>(FileInputStream.java:106)
        at org.apache.hadoop.hdfs.server.datanode.FSDataset.getMetaDataInputStream(FSDataset.java:682)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:97)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:172)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
        at java.lang.Thread.run(Thread.java:619)

________________________________________
From: Kannan Muthukkaruppan [[EMAIL PROTECTED]]
Sent: Tuesday, January 26, 2010 7:01 AM
To: [EMAIL PROTECTED]
Subject: RE: scanner lease expired/region server shutdown

1. Yes, it is a 5 node, setup.

1 Name Node/4 Data Nodes. Of the 4 DN, one is running the HBase Master, and the other three are running region servers. ZK is on all the same 5 nodes. Should ideally have separated this out. The nodes are 16GB, 4 disk machines.

2. I examined the HDFS datanode log on the same machine around that time the problems happened, and saw this:

2010-01-25 11:33:09,531 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.129.68.212:50010, storageID=DS-1418567969-10.129.68.212-50010-1263610251776, infoPort=50075, ipcPort=\
50020):Got exception while serving blk_5691809099673541164_10475 to /10.129.68.212:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.129.68.212:50010 remote=/10.129.68.212:477\
29]
        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
        at java.lang.Thread.run(Thread.java:619)

[For future runs, will try to turn on Java GC logging as well as per your suggestion.]

3.  You wrote <<< There is no line after the above?  A region with a startkey of 0031841132?  >>>

I just cut-paste a section of the PE log and the scan of the '.META.' from shell. But there were many such regions for which the PE client reported errors of the form:

10/01/25 12:31:37 DEBUG client.HConnectionManager$TableServers: locateRegionInMeta attempt 1 of 10 failed; retrying after sleep of 1000 because: Connection refused
10/01/25 12:31:37 DEBUG client.HConnectionManager$TableServers: Removed .META.,,1 for tableName=.META. from cache because of TestTable,0015411453,99999999999999
10/01/25 12:31:37 DEBUG client.HConnectionManager$TableServers: Cached location for .META.,,1 is 10.129.68.212:60020

and scan of '.META.' revealed that those regions, such as the 'TestTable,0015411453,99999999999999', had splitA & splitB portions in META.

<<< Then its a dropped edit of .META.  The parent region -- the one being split got updates which added the splitA and splitB and its offlining but the new region didn't get inserted?  The crash happened just at that time?>>>

Didn't fully understand the above.

<<< It would be interesting to take a look at the regionserver logs from that time.  Please post if you have a moment so we can take a looksee.>>>

Will do. Should I just send it as an attachment to the list? Or is their a recommended way of doing this?

regards,
Kannan
________________________________________
From: [EMAIL PROTECTED] [[EMAIL PROTECTED]] On Behalf Of Stack [[EMAIL PROTECTED]]
Sent: Monday, January 25, 2010 8:59 PM
To: [EMAIL PROTECTED]
Subject: Re: scanner lease expired/region server shutdown

What J-D said but with ornamentation.  See below.

On Mon, Jan 25, 2010 at 7:14 PM, Kannan Muthukkaruppan
<[EMAIL PROTECTED]> wrote:

Whats the cluster your hitting like?  That 5-node thingy?  Whats the
hardware profile.

This is saying that it took 65 seconds to append to hdfs.  What was
going on at that time?  A fat GC in the regionserver or over in a
Datanode?  You can enable GC logging uncommenting stuff in the
hbase-env.sh.   Feed the GC log to https://gchisto.dev.java.net/
(suggested by the zookeeper lads).  Its good for finding the long
pauses.  We should find the logs around the long GC pause.  Its
probably a failed
+
Stack 2010-01-26, 16:31
+
Kannan Muthukkaruppan 2010-01-26, 20:56
+
Stack 2010-01-27, 00:03
+
Dhruba Borthakur 2010-01-26, 18:10
+
Stack 2010-01-26, 18:24
+
Kannan Muthukkaruppan 2010-01-26, 20:58
+
Jean-Daniel Cryans 2010-01-26, 21:53
+
Kannan Muthukkaruppan 2010-01-27, 20:27
+
Stack 2010-01-26, 17:20
+
Kannan Muthukkaruppan 2010-01-27, 22:30
+
Stack 2010-01-27, 22:49