Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Replication questions

Copy link to this message
Re: Replication questions
Hmm... interesting!

So, if I understanding correctly, what you're saying regarding point 2, is
that the messages are going to be kept in memory on several nodes, and
start being served to consumers as soon as this is completed, rather than
after the data is flushed to disk? This way, we still benefit from the
throughput gain of flushing data to disk in batches, but we consider that
the added durability of having in-memory replication is good enough to
start serving that data to consumers sooner.

Furthermore, this means that in the unlikely event that several nodes would
fail simultaneously (a correlated failure), the data that is replicated to
the failed nodes but not yet flushed on any of them would be lost. However,
when a single node crashes and is then restarted, only the failed node will
have lost its unflushed data, while the other nodes that had replicated
that data will have had the opportunity to flush it to disk later on.

Sorry if I'm repeating like a parrot. I just want to make sure I understand
correctly :)

Please correct me if I'm not interpreting this correctly!


On Mon, Apr 30, 2012 at 5:59 PM, Jay Kreps <[EMAIL PROTECTED]> wrote:

> Yes, it is also worth noting that there are couple of different ways
> to think about latency:
> 1. latency of the request from the producer's point-of-view
> 2. end-to-end latency to the consumer
> As Jun mentions (1) may go up a little because the producer was
> sending data without checking for any answer from the server. Although
> this gives a nice buffering effect it leads to a number of corner
> cases that are hard to deal with correctly. It should be the case that
> setting the consumer to async has the same effect from the producer
> point of view without the corner cases of having no RPC response to
> convey errors and other broker misbehavior.
> (2) May actually get significantly better, especially for lower volume
> topics. The reason for this is because currently we wait until data is
> flushed to disk before giving it to the consumer, this flush policy is
> controlled by setting a number of messages or timeout at which the
> flush is forced. The reason to configure this is because on
> traditional disks each disk is likely to incur at least one seek. In
> the new model replication can take the place of waiting on a disk
> flush to provide durability (even if the log of the local server loses
> unflushed data as long as all servers don't crash at the same time no
> messages will be lost). Doing 2 parallel replication round-trips
> (perhaps surprisingly) looks like it may be a lot lower-latency than
> doing a local disk flush (< 1ms versus >= 10ms). In our own usage
> desire for this kind of low-latency consumption is not common, but I
> understand that this is a common need for messaging.
> -Jay
> On Thu, Apr 26, 2012 at 2:03 PM, Felix GV <[EMAIL PROTECTED]> wrote:
> > Thanks Jun :)
> >
> > --
> > Felix
> >
> >
> >
> > On Thu, Apr 26, 2012 at 3:26 PM, Jun Rao <[EMAIL PROTECTED]> wrote:
> >
> >> Some comments inlined below.
> >>
> >> Thanks,
> >>
> >> Jun
> >>
> >> On Thu, Apr 26, 2012 at 10:27 AM, Felix GV <[EMAIL PROTECTED]> wrote:
> >>
> >> > Cool :) Thanks for those insights :) !
> >> >
> >> > I changed the subject of the thread, in order not to derail the
> original
> >> > thread's subject...! I just want to recap to make sure I (and others)
> >> > understand all of this correctly :)
> >> >
> >> > So, if I understand correctly, with acks == [0,1] Kafka should
> provide a
> >> > latency that is similar to what we have now, but with the possibility
> of
> >> > losing a small window of unreplicated events in the case of an
> >> > unrecoverable hardware failure, and with acks > 1 (or acks == -1)
> there
> >> > will probably be a latency penalty but we will be completely protected
> >> from
> >> > (non-correlated) hardware failures, right?
> >> >
> >> > This is mostly true. The difference is that in 0.7, producer doesn't
> wait
> >> for a TCP response from broker. In 0.8, the producer always waits for a