My issue with ExecSource is the giant warning in the user guide:
The problem with ExecSource and other asynchronous sources is that the
source can not guarantee that if there is a failure to put the event into
the Channel the client knows about it. In such cases, the data will be
lost. As a for instance, one of the most commonly requested features is the
tail -F [file]-like use case where an application writes to a log file on
disk and Flume tails the file, sending each line as an event. While this is
possible, there’s an obvious problem; what happens if the channel fills up
and Flume can’t send an event? Flume has no way of indicating to the
application writing the log file that it needs to retain the log or that
the event hasn’t been sent, for some reason. If this doesn’t make sense,
you need only know this: Your application can never guarantee data has been
received when using a unidirectional asynchronous interface such as
ExecSource! As an extension of this warning - and to be completely clear -
there is absolutely zero guarantee of event delivery when using this
source. You have been warned."
"zero guarantee of event delivery" is a bit scary for a production system.
:) This is what I'm currently using, and have to figure out a way to
determine if events were dropped due to exceptions such as the one noted
above (I'd love to hear some thoughts on this, btw!). AFAIK, this was the
best way to accomplish the tail -F use case. Maybe I'm overly concerned
about this reliability aspect, but after reading that paragraph, it sure
left me with the impression that ExecSource was not the source of choice
for guaranteed delivery.
One of our requirements was to not have to make modifications to every
application that we wanted to get into HDFS, which is why Flume was an
obvious choice! Putting Flume inside the application was not an acceptable
solution given that requirement, unfortunately.
I am not familiar with the asynchronous log spooler. Please point me to
some links! I thought I had investigated all possibilities. :)
I didn't realize the inode limitation in Java. That does make things
"difficult" to say the least. For our immediate needs, I'll stick with
ExecSource, but look at doing a client implementation in C or Python and
pass events into an AVRO source within the agent.
Thanks so much for everyone's time and comments!
On Thu, Aug 30, 2012 at 12:07 AM, Patrick Wendell <[EMAIL PROTECTED]>wrote:
> Hey Chris,
> I'm not clear what functionality you would want from the TailSource
> could offer that's not already offered by (a) using ExecSource (b)
> putting flume inside your application or (c) using the asyncronous log
> spooler that I am working on.
> It's impossible to correctly "watch" a file from within the JVM across
> application restarts. For instance, if the file is renamed, swapped,
> or mdified while the JVM is down (as is common with rolling logs),
> there is no way to know whether the old and new file are the same.
> Within the bounds of what *is* possible, I'd say we have the use cases
> pretty much covered, but I'm open to debate if I've missed something.
> - Patrick
> On Wed, Aug 29, 2012 at 6:51 PM, Juhani Connolly
> <[EMAIL PROTECTED]> wrote:
> > Hi Chris,
> > A few months back I actually ported the original flumes tail source, but
> > was decided(and I agree with the reasoning) not to include it for a
> > of reasons, which can be seen on the original ticket at
> > https://issues.apache.org/jira/browse/FLUME-931 . One of the big ones
> is the
> > fact that java cannot access inode information.
> > What we do is have a python program that tracks the files in a directory
> > then sends the data using the scribe format to the ScribeSource(we were
> > using scribe until switching to flume, so are just using our ingest
> > from then). This allows for the freedom to customize the ingest to our
> > expectations, and we write checkpoints of how far we have tailed. You