Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Proper documentation for setting up sink groups


Copy link to this message
-
Re: Proper documentation for setting up sink groups
Please see inline.

--
Hari Shreedharan
On Thursday, August 23, 2012 at 3:28 PM, Bhaskar V. Karambelkar wrote:

> My replies in line. and thanks for the detailed explanations.
>
> On Thu, Aug 23, 2012 at 2:57 PM, Hari Shreedharan <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])> wrote:
> >
> > Please see inline.
> >
> > --
> > Hari Shreedharan
> >
> >
> > On Thursday, August 23, 2012 at 11:43 AM, Bhaskar V. Karambelkar wrote:
> >
> > > Hi Hari,
> > > Yes I did read the whole guide end to end.
> > > But I still have doubts
> > >
> > > The fact that multiple sinks can feed from the same channel is news to me. I don't see it explicitly mentioned in the docs,
> > > so i guess I assumed wrongly, that only one sink can feed from a channel.
> > >
> > > a)Can you explain in detail , how having multiple sinks taking events from one channel, is useful in a "fast source slow sinks" scenario ?
> > When multiple sinks read events from the same channel, you essentially have as many threads taking events out, since each sink has at least one thread. So if your source is dumping n events per second into the channel, and your sink can only process 1 event per second, you could have n sinks to read n events per second (this is hypothetical - your hardware and your OS will restrict performance when the number of threads starts growing a lot). A channel returns an event only once, how many ever sinks are taking from the channel. Each event if removed and committed will never be given to another sink. If there is a rollback, it is just like the event was never taken, and a different sink will be able to take and commit it.
> >
>
>
> OK this makes sense.
>
> > >
> > > b) Also if I read your explanation below correctly there are 3 possible cases
> > >
> > > 1) multiple sinks feeding from a single channel , with the default sink processor this will be like a multiplexing channel with all sinks getting all the events that come in the channel.
> > No, every time a take() is called from the channel, the channel will return that event only to one sink. So each sink will get a unique event(unless rollbacks happen - in which case the channel will put the events back into the channel and a different sink might be able to pick it up).
> >
>
>
> So this situation is exactly like a load balancing one, as events are somewhat equally distributed between all sinks ?
Not necessarily equally distributed. Sinks poll the channel to take the event. If a sink is slow in polling channels then it will get fewer events, and if a channel is faster then that will get more events, since they are running on different threads.
>
> > >
> > > 2) multiple sinks feeding from a single channel , with fail_over sink processor, only one sink will get the events at a give time, with flume failing over to next available sink in case the first one fails ?
> > A sink group essentially treats n sinks like one, and depending on the criteria, will select one sink to process the next event from the channel. In case of failover, sinks are picked in order of priority - and when one sink fails, the next one is picked.
> >
>
>
> OK this makes sense.
>
> > >
> > > 3) multiple sinks feeding from a single channel, with load balancing processor, with all sinks getting events in a round-robin/random order.
> > No, each sink will get a different event. One sink processes one event and the next one picked will process the next event from the channel.
>
>
> Yes that's exactly what I meant, I didn't imply that all sinks get all events, but the events are distributed more or less equally among the sinks in round-robin/random order.
> As I said about this looks almost like #1, except here you have a control over the selection algorithm (round-robin/random)

Not just that you have control, this will not depend on the sink's performance because all sinks are run from the same thread. So slower sinks can slow down the whole process since only one sink reads from the channel at any point in time. Think of a load balancing sink selector as a loop which picks up one sink and passes the event to that one. Since there is only one thread per sink group, having one sink group is often slower than having multiple sinks reading from the same channel.