Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Transactional writing


Copy link to this message
-
Re: Transactional writing
For this particular use case, you can potentially include the message
offset from broker A in the message sent to broker B. If the transaction
fails, you read the last message from broker B and use the included offset
to resume the consumer from broker A. This assumes that there is only a
single client writing to broker B (for a particular partition).

Thanks,

Jun

On Fri, Oct 26, 2012 at 11:31 AM, Guozhang Wang <[EMAIL PROTECTED]> wrote:

> I am also quite interested in this thread, and I have another question
> here to ask about committing consumed messages. For example, if I need a
> program which acts both as a consumer and a producer, and the actions are
> wrapped in a "transaction":
>
> Transaction start:
>
>    Get next message from broker A;
>
>    Do something;
>
>    Send a message to broker B;
>
> Commit.
>
>
> If the transaction aborts after reading the message from broker A, is it
> possible to logically "put the message back" to brokers? I remember that
> Amazon Queue Service use some sort of lease mechanism, which might work for
> this case. But I am afraid that will affect the throughput a lot..
>
> --Guozhang
>
>
> -----Original Message-----
> From: Jay Kreps [mailto:[EMAIL PROTECTED]]
> Sent: Friday, October 26, 2012 2:08 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Transactional writing
>
> This is an important feature and I am interested in helping out in the
> design and implementation, though I am working on 0.8 features for the next
> month so I may not be of too much use. I have thought a little bit about
> this, but I am not yet sure of the best approach.
>
> Here is a specific use case I think is important to address: consider a
> case where you are doing processing of one or more streams and producing an
> output stream. This processing may involve some kind of local state (say
> counters or other local aggregation intermediate state). This is a common
> scenario. The problem is to give reasonable semantics to this computation
> in the presence of failures. The processor effectively has a
> position/offset in each of its input streams as well as whatever local
> state. The problem is that if this process fails it needs to restore to a
> state that matches the last produced messages. There are several solutions
> to this problem. One is to make the output somehow idempotent, this will
> solve some cases but is not a general solution as many things cannot be
> made idempotent easily.
>
> I think the two proposals you give outline a couple of basic approaches:
> 1. Store the messages on the server somewhere but don't add them to the log
> until the commit call
> 2. Store the messages in the log but don't make them available to the
> consumer until the commit call
> Another option you didn't mention:
>
> I can give several subtleties to these approaches.
>
> One advantage of the second approach is that messages are in the log and
> can be available for reading or not. This makes it possible to support a
> kind of "dirty read" that allows the consumer to specify whether they want
> to immediately see all messages with low latency but potentially see
> uncommitted messages or only see committed messages.
>
> The problem with the second approach at least in the way you describe it is
> that you have to lock the log until the commit occurs otherwise you can't
> roll back (because otherwise someone else may have appended their own
> messages and you can't truncate the log). This would have all the problems
> of remote locks. I think this might be a deal-breaker.
>
> Another variation on the second approach would be the following: have each
> producer maintain an id and generation number. Keep a schedule of valid
> offset/id/generation numbers on the broker and only hand these out. This
> solution would support non-blocking multi-writer appends but requires more
> participation from the producer (i.e. getting a generation number and id).
>
> Cheers,
>
> -Jay
>
> On Thu, Oct 25, 2012 at 7:04 PM, Tom Brown <[EMAIL PROTECTED]> wrote:
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB