Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # dev >> scanner lease expired/region server shutdown


+
Kannan Muthukkaruppan 2010-01-26, 03:14
+
Jean-Daniel Cryans 2010-01-26, 03:23
+
Stack 2010-01-26, 04:59
+
Kannan Muthukkaruppan 2010-01-26, 15:01
+
Kannan Muthukkaruppan 2010-01-26, 15:10
+
Stack 2010-01-26, 16:31
+
Kannan Muthukkaruppan 2010-01-26, 20:56
+
Stack 2010-01-27, 00:03
+
Dhruba Borthakur 2010-01-26, 18:10
+
Stack 2010-01-26, 18:24
+
Kannan Muthukkaruppan 2010-01-26, 20:58
+
Jean-Daniel Cryans 2010-01-26, 21:53
+
Kannan Muthukkaruppan 2010-01-27, 20:27
+
Stack 2010-01-26, 17:20
Copy link to this message
-
RE: scanner lease expired/region server shutdown
1. <<< Well, on split, parent is in .META. but offlined.  Its followed by two
entries, one for each daughter region.  I was asking if after the
offlined parent, was there not an entry for a region with same start
key as parent (the daughter that is to take over from the parent?).
If not, then it would seem that update of the parent -- making it
offlined and adding the splitA and splitB columns went through but the
addition of the daughter rows did not... which is odd.>>>>

That's what seems to have happened. For all these problem regions (about 50 out of the 400 odd regions my table had), a scan of .META. shows 5 entries: info:regioninfo (with OFFLINE=>true), info:server, info:serverstartcode, info:splitA, info:splitB.

 TestTable,0076347792,126444 column=info:regioninfo, timestamp=1264470099825, value=REGION => {NAME => 'TestTa
 7634546                     ble,0076347792,1264447634546', STARTKEY => '0076347792', ENDKEY => '', ENCODED =>
                              654466296, OFFLINE => true, SPLIT => true, TABLE => {{NAME => 'TestTable', FAMIL
                             ...
 TestTable,0076347792,126444 column=info:server, timestamp=1264447641018, value=10.129.68.214:60020
 7634546
 TestTable,0076347792,126444 column=info:serverstartcode, timestamp=1264447641018, value=1264109117245
 7634546
 TestTable,0076347792,126444 column=info:splitA, timestamp=1264470099825, value=\000\n0076724048\000\000\000\0
 7634546                     01&g\003\357\277\275\e\"TestTable,0076347792,1264448682267\000\n0076347792\000\00
 TestTable,0076347792,126444 column=info:splitB, timestamp=1264470099825, value=\000\000\000\000\000\001&g\003
 7634546                     \357\277\275\e\"TestTable,0076724048,1264448682267\000\n0076724048\000\000\000\00

There is no other entry with a start key of 0076347792.

2. <<<(Thinking on it, it would be easy enough to add to the master fixup
for this condition until we do the proper fix in 0.21).>>>

Do share your thoughts on that.

3. <<< Is there a public folder some place you can post them so I can pull them.>>>

Stack: perhaps I'll email it to you offline as a start.

regards,
Kannan
-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Stack
Sent: Tuesday, January 26, 2010 9:20 AM
To: [EMAIL PROTECTED]
Subject: Re: scanner lease expired/region server shutdown

On Tue, Jan 26, 2010 at 7:01 AM, Kannan Muthukkaruppan
<[EMAIL PROTECTED]> wrote:
> 1 Name Node/4 Data Nodes. Of the 4 DN, one is running the HBase Master, and the other three are running region servers. ZK is on all the same 5 nodes. Should ideally have separated this out. The nodes are 16GB, 4 disk machines.
>

CPUs?  4 disks is good.
> 2. I examined the HDFS datanode log on the same machine around that time the problems happened, and saw this:
>
> 2010-01-25 11:33:09,531 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.129.68.212:50010, storageID=DS-1418567969-10.129.68.212-50010-1263610251776, infoPort=50075, ipcPort=\
> 50020):Got exception while serving blk_5691809099673541164_10475 to /10.129.68.212:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.129.68.212:50010 remote=/10.129.68.212:477\
> 29]
>        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
>        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313)
>        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)

Does this exception show in regionserver logs -- the client?  See
HADOOP-3831 for 'fix' that has DFSClient reestablishing connection.

Yes. Getting GC settings right is critical as it is for any java application.
Well, on split, parent is in .META. but offlined.  Its followed by two
entries, one for each daughter region.  I was asking if after the
offlined parent, was there not an entry for a region with same start
key as parent (the daughter that is to take over from the parent?).
If not, then it would seem that update of the parent -- making it
offlined and adding the splitA and splitB columns went through but the
addition of the daughter rows did not... which is odd.

(Thinking on it, it would be easy enough to add to the master fixup
for this condition until we do the proper fix in 0.21).
Don't attach.  They just get dropped.  Is there a public folder some
place you can post them so I can pull them.

Thanks Kannan.
St.Ack
+
Stack 2010-01-27, 22:49