Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # dev - Re: [jira] [Issue Comment Deleted] (FLUME-2173) Exactly once semantics for Flume


Copy link to this message
-
Re: [jira] [Issue Comment Deleted] (FLUME-2173) Exactly once semantics for Flume
Gabriel Commeau 2013-08-25, 14:24
Hi Hari,
 

I deleted my comment (again). The mailing list is probably a better avenue
to discuss this ­ sorry about that! :)

I can find at least one other way duplicate events can occur, and so what
I provided helps to reduce duplicate events but is not sufficient to
guaranty exactly once semantics. However, I still think that using a
2-phase commit when writing to multiple channels would benefit Flume. This
should probably be a different ticket though.

Concerning the algorithm you offered, the case of replicating channel
selector should probably be handled, by creating a new UUID for each
duplicate message.
I hope this helps.
Regards,

Gabriel
On 8/25/13 7:27 AM, "Gabriel Commeau (JIRA)" <[EMAIL PROTECTED]> wrote:

>
>     [
>https://issues.apache.org/jira/browse/FLUME-2173?page=com.atlassian.jira.p
>lugin.system.issuetabpanels:all-tabpanel ]
>
>Gabriel Commeau updated FLUME-2173:
>-----------------------------------
>
>    Comment: was deleted
>
>(was: I would approach the problem from a different angle. The way I see
>it, there are two main places where duplicates can occur: when using
>multiple channels for one source (using a replication channel selector),
>and when the "output" of a sink cannot guaranty whether the event has
>truly been committed or not (as you pointed out for example, HDFS writing
>the event but throwing an exception).
>Actually, I don¹t think there is a general solution to the problem of
>output systems (e.g. HDFS) that do not guaranty whether the event is
>truly committed or not, because we¹d need to enforce this requirement on
>3rd party systems (relative to Flume). I see it as a problem to be solved
>on a case-by-case basis for each sink.
>
>However, I would like to suggest a solution to the first problem. Here is
>an example to illustrate it: Pretend an agent has a source that writes to
>two (required) channels. As part of a transaction, the channel processor
>will commit to the first channel, which succeeds, and then to the second
>channel, which fails. The whole transaction will fail, but the event has
>already been committed once to the first channel. When the transaction is
>retried, the event will be duplicated.
>The solution I discussed a few months back with Mike P. was to use a
>two-phase commit when writing to channels. This insures that the events
>are not actually committed to a channel if the following ones fail. This
>however will require an API change on the Channel interface. I would
>suggest adding a preparePut method returning a boolean, which would be
>the ³voting² phase. The put method becomes the commit phase. To make it
>backward compatible, we'd implement preparePut to always return true in
>the AbstractChannel.
>
>I hope this helps.
>)
>    
>> Exactly once semantics for Flume
>> --------------------------------
>>
>>                 Key: FLUME-2173
>>                 URL: https://issues.apache.org/jira/browse/FLUME-2173
>>             Project: Flume
>>          Issue Type: Bug
>>            Reporter: Hari Shreedharan
>>            Assignee: Hari Shreedharan
>>
>> Currently Flume guarantees only at least once semantics. This jira is
>>meant to track exactly once semantics for Flume. My initial idea is to
>>include uuid event ids on events at the original source (use a config to
>>mark a source an original source) and identify destination sinks. At the
>>destination sinks, use a unique ZK Znode to track the events. If once
>>seen (and configured), pull the duplicate out.
>> This might need some refactoring, but my belief is we can do this in a
>>backward compatible way.
>
>--
>This message is automatically generated by JIRA.
>If you think it was sent incorrectly, please contact your JIRA
>administrators
>For more information on JIRA, see: http://www.atlassian.com/software/jira