Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # dev >> Batching events from event driven sources?

Copy link to this message
Batching events from event driven sources?

Raymond's mail to user@flume "performance on RecoverableMemoryChannel vs
JdbcChannel" got me thinking about how to deal with batching events from
any type of event driven source. Since we don't have control of when the
events arrive, they will generally trickle in one at a time.

I have a couple of half-baked ideas to resolve this:

- Logic in the source to keep a transaction open over a few requests...
Breaks current conventions(since at the end of the request, the event
hasn't been committed to the channel). Then again in SyslogSource, there
is no way to communicate to the sender that it wasn't inserted properly,
so the risk of loss doesn't change much.
- Add some minimum batch size setting to FileChannel, which delays
flushes until either a) a certain delay since the last one or b) x
events have been reached.
- Create a BatchingChannel... Basically configure it with a child, and
it will do receive puts, storing them in memory. After a configured
number of events are stored, it puts them to the configured child
channel. This would again be allowing for the loss of up to the
configured number of events, but no more.

Any other alternatives/ideas? None of the above feel 100% satisfactory
to me, though I think we will have to make some compromise if we want to
allow for decent performance between event driven sources and file channel.