Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume >> mail # user >> HDFS Sink stops writing events because HDFSWriter failed to append and close a file


Copy link to this message
-
HDFS Sink stops writing events because HDFSWriter failed to append and close a file
Hi All,

We are facing this issue in production flume setup.

Issue initiates when HDFS sink BucketWriter fails to append a batch for
a file because of hadoop datanode issue.
Then it tries to close that file but even close() file throws a
exception and HDFS sink does not remove that bucketwriter instance.
[Cause may be no. of bucketwriter instances are below *maxOpenFiles*].

Whats creating more problem is that HDFS sink is not letting go off that
bucketwriter instance and keeps trying to append/close to that file
infinitely which cause it to disrupt the event processing towards to HDFS.

*Logs from flume agent **
*
2013-05-20 02:32:40,669 (ResponseProcessor for block
blk_8857674139042547711_2143611) [WARN -
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:3015)]
DFSOutputStream ResponseProcessor exception  for block
blk_8857674139042547711_2143611java.net.SocketTimeoutException: 69000
millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/<datanode_ip>:41129
remote=/<datanode2_ip>:60010]
         at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
         at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
         at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
         at java.io.DataInputStream.readFully(DataInputStream.java:178)
         at java.io.DataInputStream.readLong(DataInputStream.java:399)
         at
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:124)
         at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2967)

2013-05-20 02:32:56,321 (DataStreamer for file
/flume/data/Flume_raw_1_.1368960029025.lzo.tmp block
blk_8857674139042547711_2143611) [WARN -
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3051)]
*Error Recovery for block blk_8857674139042547711_2143611 bad
datanode[0] <datanode2_ip>:60010*
2013-05-20 02:32:56,322 (DataStreamer for file
/flume/data/Flume_raw_1_.1368960029025.lzo.tmp block
blk_8857674139042547711_2143611) [WARN -
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3102)]
Error Recovery for block blk_8857674139042547711_2143611 in pipeline
<datanode2_ip>:60010, <datanode3_ip>:60010, <datanode4_ip>:60010: bad
datanode <datanode2_ip>:60010
2013-05-20 02:32:56,538 (Log-BackgroundWorker-fileChannel) [INFO -
org.apache.flume.channel.file.EventQueueBackingStoreFile.checkpoint(EventQueueBackingStoreFile.java:109)]
Start checkpoint for
/home/flume/flume_channel/checkpointDir15/checkpoint, elements to sync =
42000
2013-05-20 02:32:56,634 (Log-BackgroundWorker-fileChannel) [INFO -
org.apache.flume.channel.file.EventQueueBackingStoreFile.checkpoint(EventQueueBackingStoreFile.java:117)]
Updating checkpoint metadata: logWriteOrderID: 1370969225649, queueSize:
140156, queueHead: 72872626
2013-05-20 02:32:56,722 (Log-BackgroundWorker-fileChannel) [INFO -
org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint(LogFileV3.java:85)]
Updating log-543.meta currentPosition = 743847065, logWriteOrderID =
1370969225649
2013-05-20 02:32:56,759 (Log-BackgroundWorker-fileChannel) [INFO -
org.apache.flume.channel.file.Log.writeCheckpoint(Log.java:886)] Updated
checkpoint for file: /home/flume/flume_channel/dataDir15/log-543
position: 743847065 logWriteOrderID: 1370969225649
2013-05-20 02:33:56,331 (DataStreamer for file
/flume/data/Flume_raw_1_.1368960029025.lzo.tmp block
blk_8857674139042547711_2143611) [WARN -
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3123)]
Failed recovery attempt #0 from primary datanode <datanode3_ip>:60010
java.net.SocketTimeoutException: Call to /<datanode3_ip>:60021 failed on
socket timeout exception: java.net.SocketTimeoutException: 60000 millis
timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/<datanode_ip>:41526
remote=/<datanode3_ip>:60021]
..
..
..
2013-05-20 02:36:00,249 (DataStreamer for file
/flume/data/Flume_raw_1_.1368960029025.lzo.tmp block
blk_8857674139042547711_2143611) [WARN -
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3123)]
Failed recovery attempt #0 from primary datanode <datanode2_ip>:60010
org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.ipc.RemoteException: java.io.IOException:
blk_8857674139042547711_2143611 is already commited, storedBlock == null.
         at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampForBlock(FSNamesystem.java:5388)
         at
org.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(NameNode.java:748)
         at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
         at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
         at java.lang.reflect.Method.invoke(Method.java:597)
         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
         at java.security.AccessController.doPrivileged(Native Method)
         at javax.security.auth.Subject.doAs(Subject.java:396)
         at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)

         at org.apache.hadoop.ipc.Client.call(Client.java:1070)
         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
         at $Proxy5.nextGenerationStamp(Unknown Source)
         at
org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1999)
         at
org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1967)
         at
+
Jeff Lord 2013-05-28, 21:38
+
Ashish Tadose 2013-05-29, 06:35
+
Ron van der Vegt 2014-01-06, 11:54
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB