Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # dev >> Re: [jira] [Created] (FLUME-1479) Multiple Sinks can connect to single Channel


Copy link to this message
-
Re: [jira] [Created] (FLUME-1479) Multiple Sinks can connect to single Channel
Hi,
Due to design decisions made very early on in Flume NG - specifically the
fact that Sink only has a simple process() method - I don't see a good way
to get multiple sinks pulling from the same channel in a way that is
backwards-compatible with the current implementation.

Probably the "right" way to support this would be to have an interface
where the SinkRunner (or something outside of each Sink) is in control of
the transaction, and then it can easily send events to each sink serially
or in parallel within a single transaction. I think that is basically what
you are describing. If you look at SourceRunner and SourceProcessor you
will see similar ideas to what you are describing but they are only
implemented at the Source->Channel level. The current SinkProcessor is not
an analog of SourceProcessor, but if it was then I think that's where this
functionality might fit. However what happens when you do that is you have
to handle a ton of failure cases and threading models in a very general
way, which might be tough to get right for all use cases. I'm not 100%
sure, but I think that's why this was not pursued at the time.

To me, this seems like a potential design change (it would have to be very
carefully thought out) to consider for a future major Flume code line
(maybe a Flume 2.x).

By the way, if one is trying to get maximum throughput, then duplicating
events onto multiple channels, and having different threads running the
sinks (the current design) will be faster and more resilient in general
than a single thread and a single channel writing to multiple
sinks/destinations. The multiple-channel design pattern will allow periodic
downtimes or delays on a single sink to not affect the others, assuming the
channel sizes are large enough for buffering during downtime and assuming
that each sink is fast enough to recover from temporary delays. Without a
dedicated buffer per destination, one is at the mercy of the slowest sink
at every stage in the transaction.

One last thing worth noting is that the current channels are all well
ordered. This means that Flume currently provides a weak ordering guarantee
(across a single hop). That is a helpful property in the context of testing
and validation, as well as is what many people expect if they are storing
logs on a single hop. I hope we don't backpedal on that weak ordering
guarantee without a really good reason.

Regards,
Mike

On Fri, Aug 10, 2012 at 9:30 PM, Wang, Yongkun | Yongkun | BDD <
[EMAIL PROTECTED]> wrote:

> Hi Jhhani,
>
> Yes, we can use two (or several) channels to fan out data to different
> sinks. Then we will have two channels with same data, which may not be an
> optimized solution. So I want to use just ONE channel, creating a
> processor to pull the data once from the channel, then distributing to
> different sinks.
>
> Regards,
> Yongkun Wang
>
> On 12/08/10 18:07, "Juhani Connolly" <[EMAIL PROTECTED]>
> wrote:
>
> >Hi Yongkun,
> >
> >I'm curious why you need to pull the data twice from the sink? Do you
> >need all sinks to have read the same amount of data? Normally for the
> >case of splitting data into batch and analytics, we will send data from
> >the source to two separate channels and have the sinks read from
> >separate channels.
> >
> >On 08/10/2012 02:48 PM, Wang, Yongkun | Yongkun | BDD wrote:
> >> Hi Denny,
> >>
> >> I am working on the patch now, it's not difficult. I have listed the
> >> changes in that JIRA.
> >> I think you misunderstand my design, I didn't maintain the order of the
> >> events. Instead I make sure that each sink will get the same events (or
> >> different events specified by selector).
> >>
> >> Suppose Channel (mc) contains the following events: 4,3,2,1
> >>
> >> If simply enable it by configuration, it may work like this:
> >> Sink "hsa" may get 1,3;
> >> Sink "hsb" may get 2,4;
> >> So different sink will get different data. Is this what user wants?
> >>
> >>
> >> In my design, "hsa" and "hsb" will both get "4,3,2,1". This is a typical