Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Architectural questions

Copy link to this message
Re: Architectural questions
Hello Pankaj,

All changes for 1.3.1 release (over 1.3.0) are listed on the release
notes page: http://flume.apache.org/releases/1.3.1.html

On Sun, Aug 18, 2013 at 11:20 AM, Pankaj Gupta <[EMAIL PROTECTED]> wrote:
> Hi Hari,
> Just curios about the performance improvement, can you provide the number of
> the JIRA that improves performance in 1.3.1?
> Thanks,
> Pankaj
> On Wed, Aug 14, 2013 at 2:23 PM, Hari Shreedharan
> <[EMAIL PROTECTED]> wrote:
>> Flume v1.3.0 had a major performance issue which is why 1.3.1 was released
>> immediately after. The current stable release is 1.4.0 - so you should use
>> that.
>> 1. Can you detail this point? Channel to Sink should really not have any
>> exceptions - if the sink or a plugin the sink is using is causing rollbacks,
>> then that should handle the failure cases/drop events etc.  The channel is
>> pretty much a passive component just like a queue - "bad events" are events
>> sinks cannot handle due to some reason. The logic of handling this should be
>> in the sink itself.
>> 2. Currently that is not an option, but if you need it, chances are there
>> are others who do too. Explain your use-case in a jira. Remember, Flume is
>> not a file streaming system, it is an event streaming one, so each file is
>> still converted into events by Flume.
>> 3. If you think the current deserializers don't fit your use-case, you can
>> easily write your own and drop it in.
>> Thanks,
>> Hari
>> On Wednesday, August 14, 2013 at 1:58 PM, Robert Heise wrote:
>> Hello,
>> As I continue to ramp up using Apache Flume (v1.3.0), I have observed a
>> few challenges and hoping somebody who has more experience can shed some
>> light.
>> 1. Establishing a data pipeline is trivial, what I have noticed is that
>> any exceptions caught from the channel->sink operation invoke what appears
>> to be a repeating cycle of exceptions.  As an example, any events which
>> cause an exception (java stacktrace) put the agent into a tailspin.  There
>> are no tools for managing the pipeline to identify culprit events/files,
>> stopping, purging the channel, introspecting the pipeline etc.  The best
>> course of action is to purge everything under file-channel and restart the
>> agent.  I've read several posts posturing that using regex interceptors
>> could be a potential fix, but it is almost impossible to predict, in a
>> production environment, what exceptions are going to occur.  In my opinion,
>> there has to be a declarative manner to move bad events out of the channel
>> to a "dead-letter-queue" or equivalent.
>> 2.  I was hoping that the Spooling Directory Source would help us capture
>> file metadata, but nothing ever appears in the default .flumespool
>> trackerDir option?
>> 3. Maybe my use case is not the right fit for Flume, but my largest design
>> constraint is that we deal with files, everything we do is based on files.
>> I was hoping that the spooldir and batch control options would provide an
>> intuitive way to process files incoming to a spooldirectory, and ultimately
>> land that same data to HDFS.  However, a file with 470,000 lines is creating
>> over 52MM events and because the tooling is week, I have no visibility into
>> why that many events are being created, where the agent is in respect to
>> completing.  The data flow architecture is perfect, but maybe Flume is best
>> used for logs, tailing of files, etc, not necessarily processing files?
>> Thanks
> --
> P | (415) 677-9222 ext. 205 F | (415) 677-0895 | [EMAIL PROTECTED]
> Pankaj Gupta | Software Engineer
> BrightRoll, Inc. | Smart Video Advertising | www.brightroll.com
> United States | Canada | United Kingdom | Germany
> We're hiring!

Harsh J