This question is somewhat two part. We have Flume agents (1.3.1, recently
updated to 1.4) with HDFS Events sinks that write to our Hadoop cluster.
We will occasionally get timeouts writing to Hadoop (stack trace below).
Then eventually, the queues start backing up. Under normal load, the
queues are at 1-5%. There shouldn't be any reason the sinks can't keep up,
however, the queues eventually fill up and we have to restart the agent.
Has anyone had issues similar to this? This happens often enough that we
have to restart the agent every day or two.
The other possible issue is that we're running a Hadoop 2.0.5-alpha cluster
with 25 data nodes. How much (if any) testing has been done against Hadoop
2? I saw the build scripts had a hadoop-2 profile, but I had to modify it
to get it to build the HDFS Event Sink, so I wasn't sure the state of
compatibility or support with it.
Any help anyone can provide would be appreciated.
============06 Jul 2013 20:18:27,360 WARN
apache.flume.sink.hdfs.HDFSEventSink.process:418) - HDFS IO error
java.io.IOException: Callable timed out after 30000 ms on file:
at java.lang.Thread.run(Unknown Source)
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.FutureTask$Sync.innerGet(Unknown Source)
at java.util.concurrent.FutureTask.get(Unknown Source)
... 6 more