Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Architectural questions


Copy link to this message
-
Re: Architectural questions
Hi Hari,

Just curios about the performance improvement, can you provide the number
of the JIRA that improves performance in 1.3.1?

Thanks,
Pankaj
On Wed, Aug 14, 2013 at 2:23 PM, Hari Shreedharan <[EMAIL PROTECTED]
> wrote:

>  Flume v1.3.0 had a major performance issue which is why 1.3.1 was
> released immediately after. The current stable release is 1.4.0 - so you
> should use that.
>
> 1. Can you detail this point? Channel to Sink should really not have any
> exceptions - if the sink or a plugin the sink is using is causing
> rollbacks, then that should handle the failure cases/drop events etc.  The
> channel is pretty much a passive component just like a queue - "bad events"
> are events sinks cannot handle due to some reason. The logic of handling
> this should be in the sink itself.
>
> 2. Currently that is not an option, but if you need it, chances are there
> are others who do too. Explain your use-case in a jira. Remember, Flume is
> not a file streaming system, it is an event streaming one, so each file is
> still converted into events by Flume.
>
> 3. If you think the current deserializers don't fit your use-case, you can
> easily write your own and drop it in.
>
>
> Thanks,
> Hari
>
> On Wednesday, August 14, 2013 at 1:58 PM, Robert Heise wrote:
>
> Hello,
>
> As I continue to ramp up using Apache Flume (v1.3.0), I have observed a
> few challenges and hoping somebody who has more experience can shed some
> light.
>
> 1. Establishing a data pipeline is trivial, what I have noticed is that
> any exceptions caught from the channel->sink operation invoke what appears
> to be a repeating cycle of exceptions.  As an example, any events which
> cause an exception (java stacktrace) put the agent into a tailspin.  There
> are no tools for managing the pipeline to identify culprit events/files,
> stopping, purging the channel, introspecting the pipeline etc.  The best
> course of action is to purge everything under file-channel and restart the
> agent.  I've read several posts posturing that using regex interceptors
> could be a potential fix, but it is almost impossible to predict, in a
> production environment, what exceptions are going to occur.  In my opinion,
> there has to be a declarative manner to move bad events out of the channel
> to a "dead-letter-queue" or equivalent.
> 2.  I was hoping that the Spooling Directory Source would help us capture
> file metadata, but nothing ever appears in the default .flumespool
> trackerDir option?
> 3. Maybe my use case is not the right fit for Flume, but my largest design
> constraint is that we deal with files, everything we do is based on files.
>  I was hoping that the spooldir and batch control options would provide an
> intuitive way to process files incoming to a spooldirectory, and ultimately
> land that same data to HDFS.  However, a file with 470,000 lines is
> creating over 52MM events and because the tooling is week, I have no
> visibility into why that many events are being created, where the agent is
> in respect to completing.  The data flow architecture is perfect, but maybe
> Flume is best used for logs, tailing of files, etc, not necessarily
> processing files?
>
> Thanks
>
>
>
--
*P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | [EMAIL PROTECTED]

Pankaj Gupta | Software Engineer

*BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com
United States | Canada | United Kingdom | Germany
We're hiring<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7>
!