Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Occasional regionserver crashes following socket errors writing to HDFS


Copy link to this message
-
Occasional regionserver crashes following socket errors writing to HDFS
Eran Kutner 2012-05-10, 08:17
Hi,
We're seeing occasional regionserver crashes during heavy write operations
to Hbase (at the reduce phase of large M/R jobs). I have increased the file
descriptors, HDFS xceivers, HDFS threads to the recommended settings and
actually way above.

Here is an example of the HBase log (showing only errors):

2012-05-10 03:34:54,291 WARN org.apache.hadoop.hdfs.DFSClient:
DFSOutputStream ResponseProcessor exception  for block
blk_-8928911185099340956_5189425java.io.IOException: Bad response 1 for
block blk_-8928911185099340956_5189425 from datanode 10.1.104.6:50010
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2986)

2012-05-10 03:34:54,494 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
Exception: java.io.InterruptedIOException: Interruped while waiting for IO
on channel java.nio.channels.SocketChannel[connected
local=/10.1.104.9:59642remote=/
10.1.104.9:50010]. 0 millis timeout left.
        at
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
        at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
        at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
        at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2848)

2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-8928911185099340956_5189425 bad datanode[2]
10.1.104.6:50010
2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-8928911185099340956_5189425 in pipeline
10.1.104.9:50010, 10.1.104.8:50010, 10.1.104.6:50010: bad datanode
10.1.104.6:50010
2012-05-10 03:48:30,174 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
serverName=hadoop1-s09.farm-ny.gigya.com,60020,1336476100422,
load=(requests=15741, regions=789, usedHeap=6822, maxHeap=7983):
regionserver:60020-0x2372c0e8a2f0008 regionserver:60020-0x2372c0e8a2f0008
received expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired
        at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352)
        at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270)
        at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
        at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
java.io.InterruptedIOException: Aborting compaction of store properties in
region
gs_users,6155551|QoCW/euBIKuMat/nRC5Xtw==,1334983658004.878522ea91f41cd76b903ea06ccd17f9.
because user requested stop.
        at
org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:998)
        at
org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:779)
        at
org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:776)
        at
org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:721)
        at
org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
This is from 10.1.104.9 (same machine running the region server that
crashed):
2012-05-10 03:31:16,785 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_-8928911185099340956_5189425 src: /10.1.104.9:59642 dest: /
10.1.104.9:50010
2012-05-10 03:35:39,000 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
Connection reset
2012-05-10 03:35:39,052 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock
for block blk_-8928911185099340956_5189425
java.nio.channels.ClosedByInterruptException
2012-05-10 03:35:39,053 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_-8928911185099340956_5189425 received exception java.io.IOException:
Interrupted receiveBlock
2012-05-10 03:35:39,055 ERROR
org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
blk_-8928911185099340956_5189425 length is 24384000 does not match block
file length 24449024
2012-05-10 03:35:39,055 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 3 on 50020, call
startBlockRecovery(blk_-8928911185099340956_5189425) from 10.1.104.8:50251:
error: java.io.IOException: Block blk_-8928911185099340956_5189425 length
is 24384000 does not match block file length 24449024
java.io.IOException: Block blk_-8928911185099340956_5189425 length is
24384000 does not match block file length 24449024
2012-05-10 03:35:39,077 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
Broken pipe
2012-05-10 03:35:39,077 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
Socket closed
2012-05-10 03:35:39,108 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
Socket closed
2012-05-10 03:35:39,136 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
Socket closed
2012-05-10 03:35:39,165 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
Socket closed
2012-05-10 03:35:39,196 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
Socket closed
2012-05-10 03:35:39,221 INFO
org.apache.hadoop.hdfs.server
+
dva 2012-08-30, 06:26
+
dva 2012-08-30, 06:26
+
Stack 2012-08-30, 22:36
+
Stack 2012-05-11, 05:07
+
Igal Shilman 2012-05-10, 09:25
+
Eran Kutner 2012-05-10, 11:33
+
Michel Segel 2012-05-10, 11:53
+
Eran Kutner 2012-05-10, 12:22
+
Michael Segel 2012-05-10, 13:26
+
Dave Revell 2012-05-10, 17:31
+
Michael Segel 2012-05-10, 18:30
+
Dave Revell 2012-05-10, 18:41
+
Michael Segel 2012-05-10, 18:59
+
Eran Kutner 2012-05-10, 19:17
+
Michael Segel 2012-05-10, 19:50
+
Stack 2012-05-10, 21:57
+
Michael Segel 2012-05-11, 02:46
+
Stack 2012-05-11, 03:34
+
Michael Segel 2012-05-11, 01:28
+
Stack 2012-05-11, 03:28
+
Michael Segel 2012-05-11, 03:44
+
Stack 2012-05-11, 03:53
+
Stack 2012-05-11, 05:12
+
Michael Segel 2012-05-11, 11:36
+
Eran Kutner 2012-05-24, 11:15
+
Michael Segel 2012-05-24, 12:13
+
Stack 2012-05-24, 23:39
+
Dave Revell 2012-05-25, 19:52
+
Stack 2012-05-11, 05:08