Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume, mail # user - Re: New blog post on Flume performance tuning


+
Mohammad Tariq 2013-01-11, 20:48
+
Xu 2013-01-11, 20:59
Copy link to this message
-
Re: New blog post on Flume performance tuning
Mike Percy 2013-01-11, 22:27
Hi Simon,
There is no good way that I am aware of for Flume to dedup messages. This
is because there is no abstraction for doing pairwise comparison of events,
and, as you scale up, maintaining some kind of hash table of processed
events generally becomes prohibitive or makes it not worth the effort at
the streaming layer.

The most straightforward way to dedup Flume events is to tag them with some
kind of unique ID at event creation time. Then you can dedup with a
MapReduce job (in the case of writing to HDFS) or by making your operations
idempotent (in the case, for example, of writing keys to HBase).

Regards,
Mike

On Fri, Jan 11, 2013 at 12:59 PM, Xu (Simon) Chen <[EMAIL PROTECTED]> wrote:

> Great post, Mike!
>
> One question if you can either address via mailing list or future posts...
>
> I am curious about how to remove duplicated messages in this flow. For
> example, when I set up a switch/router to send syslog messages, I'd
> like to send two syslog collectors or two flume agents. In this case,
> the switch/router is just a dumb device, not knowing how to fail-over
> or load-balance. As a result, we have two copies of the same message
> going into flume.
>
> I have seen people describing doing hbase operations to remove
> duplicates, but I am wondering if we can do anything in the flume
> infrastructure.
>
> Thanks.
> -Simon
>
> On Fri, Jan 11, 2013 at 3:48 PM, Mohammad Tariq <[EMAIL PROTECTED]>
> wrote:
> > +1
> >
> > Thank you so much Mike, for all the good work.
> >
> > Warm Regards,
> > Tariq
> > https://mtariq.jux.com/
> >
> >
> > On Sat, Jan 12, 2013 at 2:15 AM, Mike Percy <[EMAIL PROTECTED]> wrote:
> >>
> >> Thanks Brock! I've been working on this, off and on, for a while. :)
> >>
> >>
> >> On Fri, Jan 11, 2013 at 12:18 PM, Brock Noland <[EMAIL PROTECTED]>
> wrote:
> >>>
> >>> Nice post!
> >>>
> >>> On Fri, Jan 11, 2013 at 12:13 PM, Mike Percy <[EMAIL PROTECTED]>
> wrote:
> >>> > Hi folks,
> >>> > I just posted to the Apache blog on how to do performance tuning with
> >>> > Flume.
> >>> > I plan on following it up with a post about using the Flume
> monitoring
> >>> > capabilities while tuning. Feedback is welcome.
> >>> >
> >>> > https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1
> >>> >
> >>> > Regards,
> >>> > Mike
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Apache MRUnit - Unit testing MapReduce -
> >>> http://incubator.apache.org/mrunit/
> >>
> >>
> >
>