Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # dev - scanner lease expired/region server shutdown


+
Kannan Muthukkaruppan 2010-01-26, 03:14
+
Jean-Daniel Cryans 2010-01-26, 03:23
+
Stack 2010-01-26, 04:59
Copy link to this message
-
RE: scanner lease expired/region server shutdown
Kannan Muthukkaruppan 2010-01-26, 15:01

1. Yes, it is a 5 node, setup.

1 Name Node/4 Data Nodes. Of the 4 DN, one is running the HBase Master, and the other three are running region servers. ZK is on all the same 5 nodes. Should ideally have separated this out. The nodes are 16GB, 4 disk machines.

2. I examined the HDFS datanode log on the same machine around that time the problems happened, and saw this:

2010-01-25 11:33:09,531 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.129.68.212:50010, storageID=DS-1418567969-10.129.68.212-50010-1263610251776, infoPort=50075, ipcPort=\
50020):Got exception while serving blk_5691809099673541164_10475 to /10.129.68.212:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.129.68.212:50010 remote=/10.129.68.212:477\
29]
        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
        at java.lang.Thread.run(Thread.java:619)

[For future runs, will try to turn on Java GC logging as well as per your suggestion.]

3.  You wrote <<< There is no line after the above?  A region with a startkey of 0031841132?  >>>

I just cut-paste a section of the PE log and the scan of the '.META.' from shell. But there were many such regions for which the PE client reported errors of the form:

10/01/25 12:31:37 DEBUG client.HConnectionManager$TableServers: locateRegionInMeta attempt 1 of 10 failed; retrying after sleep of 1000 because: Connection refused
10/01/25 12:31:37 DEBUG client.HConnectionManager$TableServers: Removed .META.,,1 for tableName=.META. from cache because of TestTable,0015411453,99999999999999
10/01/25 12:31:37 DEBUG client.HConnectionManager$TableServers: Cached location for .META.,,1 is 10.129.68.212:60020

and scan of '.META.' revealed that those regions, such as the 'TestTable,0015411453,99999999999999', had splitA & splitB portions in META.

<<< Then its a dropped edit of .META.  The parent region -- the one being split got updates which added the splitA and splitB and its offlining but the new region didn't get inserted?  The crash happened just at that time?>>>

Didn't fully understand the above.

<<< It would be interesting to take a look at the regionserver logs from that time.  Please post if you have a moment so we can take a looksee.>>>

Will do. Should I just send it as an attachment to the list? Or is their a recommended way of doing this?

regards,
Kannan
________________________________________
From: [EMAIL PROTECTED] [[EMAIL PROTECTED]] On Behalf Of Stack [[EMAIL PROTECTED]]
Sent: Monday, January 25, 2010 8:59 PM
To: [EMAIL PROTECTED]
Subject: Re: scanner lease expired/region server shutdown

What J-D said but with ornamentation.  See below.

On Mon, Jan 25, 2010 at 7:14 PM, Kannan Muthukkaruppan
<[EMAIL PROTECTED]> wrote:
> I was doing some brute force testing - running one instance of PerformanceEvaluation (PE) for writes, and another instance for randomReads.
>

Whats the cluster your hitting like?  That 5-node thingy?  Whats the
hardware profile.
> One of the region servers went down after a while. [This is on 0.20.2]. The region server log had things like:
>
> 2010-01-25 11:33:39,416 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC Server handler 34 on 60020 took 65190ms appending an edit to hlog; editcount=27878

This is saying that it took 65 seconds to append to hdfs.  What was
going on at that time?  A fat GC in the regionserver or over in a
Datanode?  You can enable GC logging uncommenting stuff in the
hbase-env.sh.   Feed the GC log to https://gchisto.dev.java.net/
(suggested by the zookeeper lads).  Its good for finding the long
pauses.  We should find the logs around the long GC pause.  Its
probably a failed promotion that brought on the stop-the-world GC.  Or
your HDFS was struggling?

I was going to say that the PE puts the worse kind of loading on the
hbase cache -- nothing sticks around -- but looking at your numbers
below, cache seems to be working pretty well.
This is the zk timeout.
These will happen after the above.  The regionserver is on its way
down.  Probably emptied the list of outstanding regions.

There is no line after the above?  A region with a startkey of
0031841132?  Then its a dropped edit of .META.  The parent region --
the one being split got updates which added the splitA and splitB and
its offlining but the new region didn't get inserted?  The crash
happened just at that time?

It would be interesting to take a look at the regionserver logs from
that time.  Please post if you have a moment so we can take a looksee.

Above kinda thing is what the master rewrite is about moving state
transitions up to zk so atomic over cluster moving regions through
transitions rather than as here, a multi-row update might not all go
through as things currently work.

St.Ack

+
Kannan Muthukkaruppan 2010-01-26, 15:10
+
Stack 2010-01-26, 16:31
+
Kannan Muthukkaruppan 2010-01-26, 20:56
+
Stack 2010-01-27, 00:03
+
Dhruba Borthakur 2010-01-26, 18:10
+
Stack 2010-01-26, 18:24
+
Kannan Muthukkaruppan 2010-01-26, 20:58
+
Jean-Daniel Cryans 2010-01-26, 21:53
+
Kannan Muthukkaruppan 2010-01-27, 20:27
+
Stack 2010-01-26, 17:20
+
Kannan Muthukkaruppan 2010-01-27, 22:30
+
Stack 2010-01-27, 22:49