Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # dev >> Questions about Batching in Flume

Copy link to this message
Re: Questions about Batching in Flume
Hey Folks,

So the hole in my thinking was, as brock pointed out, that the
FileChannel doesn't actually sync() until a commit. I misread the code
while looking at it quickly. So it does allow batching within a
transaction as desired.

The JDBC channel however looks like it persists the events on every
put() rather than on transaction boundaries:

public void put(Event event) throws ChannelException {
  getProvider().persistEvent(getName(), event);

Am I wrong on this one as well?

- Patrick

On Wed, Jul 11, 2012 at 2:12 AM, Juhani Connolly
> I think some of my earlier speculation may have lead to this
> misunderstanding? I can confirm after changing the exec source that the
> puts/takes themselves are not generating the bottleneck, and that
> performance is fine so long as the number of transactions is not too
> large(as each transaction commit will cause an fsync).
> An option for the channel to store x events on the heap before flushing
> could be interesting, though it would void any guarrantee deliveries made. I
> do not think this is necessarily a bad thing so long as it is documented(and
> people who want everything committed can request flushing the buffer every
> commit).
> On 07/11/2012 04:05 PM, Brock Noland wrote:
>> What leads you to that conclusion about FC? (I am curious in case there is
>> something I am unaware of.) This is where a Put ends up being written and
>> there is no flush until a commit.
>> https://github.com/apache/flume/blob/trunk/flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/LogFile.java#L165
>> Brock
>> On Wed, Jul 11, 2012 at 7:12 AM, Patrick Wendell <[EMAIL PROTECTED]>
>> wrote:
>>> Hi All,
>>> Most streaming systems have built-in support for batching since it
>>> often offers major performance benefits in terms of throughput.
>>> I'm a little confused about the state of batching in Flume today. It
>>> looks like a ChannelProcessor can process a batch of events within one
>>> transaction, but internally this just calls Channel.put() several
>>> times.
>>> As far as I can tell, both of the durable channels (JDBC and File)
>>> actually flush to disk in some fashion whenever there is a doPut(). It
>>> seems to me like it makes sense to buffer all of those puts in memory
>>> and only flush them once per transaction. Otherwise, isn't the benefit
>>> of batching put()'s within a transaction lost?
>>> I think I might be missing something here, any pointers are appreciated.
>>> - Patrick