Flume v1.3.0 had a major performance issue which is why 1.3.1 was released immediately after. The current stable release is 1.4.0 - so you should use that.
1. Can you detail this point? Channel to Sink should really not have any exceptions - if the sink or a plugin the sink is using is causing rollbacks, then that should handle the failure cases/drop events etc. The channel is pretty much a passive component just like a queue - "bad events" are events sinks cannot handle due to some reason. The logic of handling this should be in the sink itself.
2. Currently that is not an option, but if you need it, chances are there are others who do too. Explain your use-case in a jira. Remember, Flume is not a file streaming system, it is an event streaming one, so each file is still converted into events by Flume.
3. If you think the current deserializers don't fit your use-case, you can easily write your own and drop it in.
On Wednesday, August 14, 2013 at 1:58 PM, Robert Heise wrote:
> As I continue to ramp up using Apache Flume (v1.3.0), I have observed a few challenges and hoping somebody who has more experience can shed some light.
> 1. Establishing a data pipeline is trivial, what I have noticed is that any exceptions caught from the channel->sink operation invoke what appears to be a repeating cycle of exceptions. As an example, any events which cause an exception (java stacktrace) put the agent into a tailspin. There are no tools for managing the pipeline to identify culprit events/files, stopping, purging the channel, introspecting the pipeline etc. The best course of action is to purge everything under file-channel and restart the agent. I've read several posts posturing that using regex interceptors could be a potential fix, but it is almost impossible to predict, in a production environment, what exceptions are going to occur. In my opinion, there has to be a declarative manner to move bad events out of the channel to a "dead-letter-queue" or equivalent.
> 2. I was hoping that the Spooling Directory Source would help us capture file metadata, but nothing ever appears in the default .flumespool trackerDir option?
> 3. Maybe my use case is not the right fit for Flume, but my largest design constraint is that we deal with files, everything we do is based on files. I was hoping that the spooldir and batch control options would provide an intuitive way to process files incoming to a spooldirectory, and ultimately land that same data to HDFS. However, a file with 470,000 lines is creating over 52MM events and because the tooling is week, I have no visibility into why that many events are being created, where the agent is in respect to completing. The data flow architecture is perfect, but maybe Flume is best used for logs, tailing of files, etc, not necessarily processing files?