Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Small cluster, " Bad connect ack with firstBadLink as XX.XX.XX.140:51010"


Copy link to this message
-
Small cluster, " Bad connect ack with firstBadLink as XX.XX.XX.140:51010"
I'm testing flush. I'm using branch-0.21 hdfs.  I run a write load and kill
one of the members of my little 4-node hdfs cluster.  1 of the 4 datanodes
being down does more damage than it should.  Any suggestions for how I can
make it less catastrophic appreciated.

Here's the story.   140 is the datanode I killed.  In the below logging we
are closing one log file and opening another.  The open of the new log file
fails because of "Bad connect ack with firstBadLink".  The failed open is
dire news for hbase:

2009-11-12 00:06:50,111 [IPC Server handler 13 on 60020] INFO
org.apache.hadoop.hbase.regionserver.wal.HLog: edit=52700,
write=TestTable/TestTable,,1257984387166/870094
2009-11-12 00:06:50,205 [regionserver/XX.XX.XX.141:60020.logRoller] INFO
org.apache.hadoop.hbase.regionserver.wal.HLog: Roll
/hbase/.logs/XXX-XXX-14.X.X.X,60020,1257984279957/hlog.dat.1257984280320,
entries=52744, calcsize=63765500, filesize=58596834. New hlog
/hbase/.logs/XXX-XXX-14.X.X.X,60020,12
57984279957/hlog.dat.1257984410199
2009-11-12 00:06:50,211 [regionserver/XX.XX.XX.141:60020.logRoller] DEBUG
org.apache.hadoop.hbase.regionserver.wal.HLog: Found 0 hlogs to remove  out
of total 1; oldest outstanding seqnum is 4 from region -ROOT-,,0
2009-11-12 00:06:50,211 [IPC Server handler 12 on 60020] INFO
org.apache.hadoop.hbase.regionserver.wal.HLog: edit=0,
write=TestTable/TestTable,,1257984387166/870139
2009-11-12 00:06:50,213 [regionserver/XX.XX.XX.141:60020.logSyncer] INFO
org.apache.hadoop.hbase.regionserver.wal.HLog: sync
2009-11-12 00:06:50,531 [Thread-62] INFO org.apache.hadoop.hdfs.DFSClient:
Exception in createBlockOutputStream java.io.IOException: Bad connect ack
with firstBadLink as XX.XX.XX.140:51010
2009-11-12 00:06:50,532 [Thread-62] INFO org.apache.hadoop.hdfs.DFSClient:
Abandoning block blk_2806163206344237164_1089
2009-11-12 00:06:56,570 [Thread-62] INFO org.apache.hadoop.hdfs.DFSClient:
Exception in createBlockOutputStream java.io.IOException: Bad connect ack
with firstBadLink as XX.XX.XX.140:51010
2009-11-12 00:06:56,570 [Thread-62] INFO org.apache.hadoop.hdfs.DFSClient:
Abandoning block blk_246229128928850217_1090
2009-11-12 00:07:02,592 [Thread-62] INFO org.apache.hadoop.hdfs.DFSClient:
Exception in createBlockOutputStream java.io.IOException: Bad connect ack
with firstBadLink as XX.XX.XX.140:51010
2009-11-12 00:07:02,592 [Thread-62] INFO org.apache.hadoop.hdfs.DFSClient:
Abandoning block blk_-4429542440123385070_1090
2009-11-12 00:07:08,603 [Thread-62] INFO org.apache.hadoop.hdfs.DFSClient:
Exception in createBlockOutputStream java.io.IOException: Bad connect ack
with firstBadLink as XX.XX.XX.140:51010
2009-11-12 00:07:08,603 [Thread-62] INFO org.apache.hadoop.hdfs.DFSClient:
Abandoning block blk_-8393964942281984584_1092
2009-11-12 00:07:14,616 [Thread-62] WARN org.apache.hadoop.hdfs.DFSClient:
DataStreamer Exception: java.io.IOException: Unable to create new block.
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)

2009-11-12 00:07:14,616 [Thread-62] WARN org.apache.hadoop.hdfs.DFSClient:
Could not get block locations. Source file
"/hbase/.logs/XXX-XXX-14.X.X.X,60020,1257984279957/hlog.dat.1257984410199" -
Aborting...
2009-11-12 00:07:14,621 [regionserver/XX.XX.XX.141:60020.logSyncer] FATAL
org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append. Requesting
close of hlog
Why is it dogged about 140?  Why not move on when it gets the bad connect?
Why fail with "Unable to create new block"?

Here is the NN log excerpt for blocks 1089:

2009-11-12 00:06:50,214 DEBUG org.apache.hadoop.hdfs.StateChange: DIR*
FSDirectory.addFile:
/hbase/.logs/XX-XX-14.X.X.X,60020,1257984279957/hlog.dat.1257984410199 with
blk_2806163206344237164_1089 block is added to the in-memory file system
2009-11-12 00:06:50,214 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.allocateBlock:
/hbase/.logs/XX-XX-14.X.X.X,60020,1257984279957/hlog.dat.1257984410199.
blk_2806163206344237164_1089{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[XX.XX.XX.141:51010
|RBW], ReplicaUnderConstruction[XX.XX.XX.142:51010|RBW],
ReplicaUnderConstruction[XX.XX.XX.140:51010|RBW]]}
2009-11-12 00:06:50,532 DEBUG org.apache.hadoop.hdfs.StateChange: *BLOCK*
NameNode.abandonBlock: blk_2806163206344237164_1089 of file
/hbase/.logs/XX-XX-14.X.X.X,60020,1257984279957/hlog.dat.1257984410199
2009-11-12 00:06:50,532 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.abandonBlock: blk_2806163206344237164_1089of file
/hbase/.logs/XX-XX-14.X.X.X,60020,1257984279957/hlog.dat.1257984410199
2009-11-12 00:06:50,533 DEBUG org.apache.hadoop.hdfs.StateChange: DIR*
FSDirectory.addFile:
/hbase/.logs/XX-XX-14.X.X.X,60020,1257984279957/hlog.dat.1257984410199 with
blk_2806163206344237164_1089 block is added to the file system
2009-11-12 00:06:50,533 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.abandonBlock: blk_2806163206344237164_1089 is removed from
pendingCreates

The story on 1090 block is same as it is for 1092.

Thanks for any help,
St.Ack
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB