Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka >> mail # user >> Transactional writing


+
Tom Brown 2012-10-25, 21:44
+
Neha Narkhede 2012-10-26, 01:19
+
Philip OToole 2012-10-26, 01:31
+
Tom Brown 2012-10-26, 02:04
+
Evan chan 2012-10-26, 02:58
+
Jay Kreps 2012-10-26, 18:08
+
Guozhang Wang 2012-10-26, 18:31
+
Jun Rao 2012-10-29, 03:09
+
Jun Rao 2012-10-29, 05:31
+
Rohit Prasad 2012-11-02, 22:11
+
Tom Brown 2012-11-02, 23:12
+
Rohit Prasad 2012-11-03, 18:51
Copy link to this message
-
Re: Transactional writing
Why wouldn't the storm approach provide semantics of exactly once
delivery? https://github.com/nathanmarz/storm

Nathan actually credits the Kafka_devs for the basic idea of transaction
persisting in one of his talks.

Regards
Milind

On Nov 3, 2012 11:51 AM, "Rohit Prasad" <[EMAIL PROTECTED]> wrote:

> I agree that this approach only prevents duplicate messages to partition
> from the Producer side. There needs to be a similar approach on the
> consumer side too. Using Zk can be one solution, or other non-ZK
> approaches.
>
> Even if Consumer reads none or all messages of a transaction. But that does
> not solve the transaction problem yet. Because the business/application
> logic inside the Consumer thread may execute partially and fail. So it
> becomes tricky to decide the point when you want to say that you have
> "consumed" the message and increase consumption offset. If your consumer
> thread is saving some value  into DB/HDFS/etc, ideally you want this save
> operation and consumption offset to be incremented atomically. Thats why it
> boils down to Application logic implementing transactions and dealing with
> duplicates.
> Maybe a journalling or redo log approach on Consumer side can help build
> such a system.
>
> It will be nice if eventually kafka can be a transport which provides
> "exactly once" semantics for message delivery. Then consumer threads can be
> sure that they receive messages once, and can build appln logic on top of
> that.
>
> I have a use case similar to what Jay mentioned in a previous mail. I want
> to do aggregation but want the aggregated data to be correct, possible
> avoiding duplicates incase of failures/crashes.
>
>
>
> On Fri, Nov 2, 2012 at 4:12 PM, Tom Brown <[EMAIL PROTECTED]> wrote:
>
> > That approach allows a producer to prevent duplicate messages to the
> > partition, but what about the consumer? In my case, I don't want the
> > consumer to be able to read any of the messages unless it can read all
> > of the messages from a transaction.
> >
> > I also like the idea of there being multiple types of Kafka
> > transaction, though, just to accommodate different performance,
> > reliability, and consumption patterns. Of course, the added complexity
> > of that might just sink the whole thing.
> >
> > --Tom
> >
> > On Fri, Nov 2, 2012 at 4:11 PM, Rohit Prasad <[EMAIL PROTECTED]>
> > wrote:
> > > Getting transactional support is quite hard problem. There will always
> be
> > > corner cases where the solution will not work, unless you want to go
> down
> > > the path of 2PC, paxos, etc which ofcourse will degrade kafka's
> > > performance. It is best to reconcile data and deal with duplicate
> > messages
> > > in Application layer. Having said that it would be amazing if we can
> > build
> > > "at most once" semantics in Kafka!!
> > >
> > > Regarding above approaches,
> > > The producer will always have a doubt if its commit went through. i.e.
> if
> > > the ack for "commit" is not received by the producer. Or If producer
> dies
> > > immediately after calling the commit. When it is restarted how does it
> > know
> > > if last operation went through?
> > >
> > > I suggest the following -
> > > 1. Producer should attach a timestamp at the beginning of each message
> > and
> > > send it to Server.
> > > 2. On restarts/timeouts/re-connections, the producer should first read
> > the
> > > last committed message from the leader of the partition.
> > > 3. From timestamp, it can know how many messages went through before it
> > > died (or connection was broken). And it can infer how many messages to
> > > replay.
> > >
> > > The above approach can be used with existing Kafka libraries since you
> > can
> > > have a producer and consumer thread together in an application to
> > implement
> > > this logic. Or someone can take the initiative to write a Transactional
> > > producer (which internally has both producer and a consumer to read
> last
> > > committed message.) I will be developing one for kafka 0.8 in c++.
+
Jun Rao 2012-10-26, 04:15
+
Tom Brown 2012-10-26, 05:00
+
Jun Rao 2012-10-26, 14:24
+
Tom Brown 2012-10-26, 14:29
+
Jay Kreps 2012-10-26, 18:29
+
Jay Kreps 2012-10-26, 18:18
+
Jason Rosenberg 2012-10-26, 18:29
+
Jay Kreps 2012-10-26, 18:47
+
Jonathan Hodges 2013-03-27, 21:42
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB