Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Architectural questions

Copy link to this message
Re: Architectural questions
Hi Hari,

Just curios about the performance improvement, can you provide the number
of the JIRA that improves performance in 1.3.1?

On Wed, Aug 14, 2013 at 2:23 PM, Hari Shreedharan <[EMAIL PROTECTED]
> wrote:

>  Flume v1.3.0 had a major performance issue which is why 1.3.1 was
> released immediately after. The current stable release is 1.4.0 - so you
> should use that.
> 1. Can you detail this point? Channel to Sink should really not have any
> exceptions - if the sink or a plugin the sink is using is causing
> rollbacks, then that should handle the failure cases/drop events etc.  The
> channel is pretty much a passive component just like a queue - "bad events"
> are events sinks cannot handle due to some reason. The logic of handling
> this should be in the sink itself.
> 2. Currently that is not an option, but if you need it, chances are there
> are others who do too. Explain your use-case in a jira. Remember, Flume is
> not a file streaming system, it is an event streaming one, so each file is
> still converted into events by Flume.
> 3. If you think the current deserializers don't fit your use-case, you can
> easily write your own and drop it in.
> Thanks,
> Hari
> On Wednesday, August 14, 2013 at 1:58 PM, Robert Heise wrote:
> Hello,
> As I continue to ramp up using Apache Flume (v1.3.0), I have observed a
> few challenges and hoping somebody who has more experience can shed some
> light.
> 1. Establishing a data pipeline is trivial, what I have noticed is that
> any exceptions caught from the channel->sink operation invoke what appears
> to be a repeating cycle of exceptions.  As an example, any events which
> cause an exception (java stacktrace) put the agent into a tailspin.  There
> are no tools for managing the pipeline to identify culprit events/files,
> stopping, purging the channel, introspecting the pipeline etc.  The best
> course of action is to purge everything under file-channel and restart the
> agent.  I've read several posts posturing that using regex interceptors
> could be a potential fix, but it is almost impossible to predict, in a
> production environment, what exceptions are going to occur.  In my opinion,
> there has to be a declarative manner to move bad events out of the channel
> to a "dead-letter-queue" or equivalent.
> 2.  I was hoping that the Spooling Directory Source would help us capture
> file metadata, but nothing ever appears in the default .flumespool
> trackerDir option?
> 3. Maybe my use case is not the right fit for Flume, but my largest design
> constraint is that we deal with files, everything we do is based on files.
>  I was hoping that the spooldir and batch control options would provide an
> intuitive way to process files incoming to a spooldirectory, and ultimately
> land that same data to HDFS.  However, a file with 470,000 lines is
> creating over 52MM events and because the tooling is week, I have no
> visibility into why that many events are being created, where the agent is
> in respect to completing.  The data flow architecture is perfect, but maybe
> Flume is best used for logs, tailing of files, etc, not necessarily
> processing files?
> Thanks
*P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | [EMAIL PROTECTED]

Pankaj Gupta | Software Engineer

*BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com
United States | Canada | United Kingdom | Germany
We're hiring<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7>