Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # dev >> Questions about Batching in Flume


Copy link to this message
-
Re: Questions about Batching in Flume
Hey Folks,

So the hole in my thinking was, as brock pointed out, that the
FileChannel doesn't actually sync() until a commit. I misread the code
while looking at it quickly. So it does allow batching within a
transaction as desired.

The JDBC channel however looks like it persists the events on every
put() rather than on transaction boundaries:

@Override
public void put(Event event) throws ChannelException {
  getProvider().persistEvent(getName(), event);
}

Am I wrong on this one as well?

- Patrick

On Wed, Jul 11, 2012 at 2:12 AM, Juhani Connolly
<[EMAIL PROTECTED]> wrote:
> I think some of my earlier speculation may have lead to this
> misunderstanding? I can confirm after changing the exec source that the
> puts/takes themselves are not generating the bottleneck, and that
> performance is fine so long as the number of transactions is not too
> large(as each transaction commit will cause an fsync).
>
> An option for the channel to store x events on the heap before flushing
> could be interesting, though it would void any guarrantee deliveries made. I
> do not think this is necessarily a bad thing so long as it is documented(and
> people who want everything committed can request flushing the buffer every
> commit).
>
>
> On 07/11/2012 04:05 PM, Brock Noland wrote:
>>
>> What leads you to that conclusion about FC? (I am curious in case there is
>> something I am unaware of.) This is where a Put ends up being written and
>> there is no flush until a commit.
>>
>>
>> https://github.com/apache/flume/blob/trunk/flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/LogFile.java#L165
>>
>> Brock
>>
>> On Wed, Jul 11, 2012 at 7:12 AM, Patrick Wendell <[EMAIL PROTECTED]>
>> wrote:
>>
>>> Hi All,
>>>
>>> Most streaming systems have built-in support for batching since it
>>> often offers major performance benefits in terms of throughput.
>>>
>>> I'm a little confused about the state of batching in Flume today. It
>>> looks like a ChannelProcessor can process a batch of events within one
>>> transaction, but internally this just calls Channel.put() several
>>> times.
>>>
>>> As far as I can tell, both of the durable channels (JDBC and File)
>>> actually flush to disk in some fashion whenever there is a doPut(). It
>>> seems to me like it makes sense to buffer all of those puts in memory
>>> and only flush them once per transaction. Otherwise, isn't the benefit
>>> of batching put()'s within a transaction lost?
>>>
>>> I think I might be missing something here, any pointers are appreciated.
>>>
>>> - Patrick
>>>
>>
>>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB