Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka, mail # dev - Re: get offsets?


Copy link to this message
-
Re: get offsets?
Jeffrey Damick 2011-09-22, 21:11
But why search when you could track and index it directly?

FWIW, We considering doing what someone suggested below as well, but for
many of the same reasons listed below and the added complexity if you want
it to be fault tolerant with tiered brokers we had to unfortunately abandon
going further with kafka..
On Thu, Sep 22, 2011 at 4:55 PM, Neha Narkhede <[EMAIL PROTECTED]>wrote:

> >> To neha's comment, sounds like you'd need some kind of table to maintain
> the list of offsets and which segment they live in?
>
> Not really. What I was suggesting is maintaining a table per log
> segment, keyed by offset that maintains a mapping from offset to
> num-messages-since-start-of-log-segment.
> This would allow a binary search to look for the closest offset to the
> nth message in the log segment.
>
> >> for every write of a message to topic foo, write the new offset into
> topic foo_i.
>
> If I understand the use case correctly, what you want is, at some
> point in the consumption for topic 'foo', you want to "go back n
> messages".
>
> I can see a number of problems with the approach mentioned above,
>
> 1. Writing offsets to topic foo_i per message in foo will be
> expensive. There is no batching on the producer side. And also, no way
> to ensure that each message in foo and its corresponding message in
> foo_i are flushed to disk atomically to disk on the broker. There
> could be a window of error there if the broker crashes.
> 2. Garbage collection for topics foo and foo_i would have to be in lock
> step.
> 3. Depending on how frequently you need to "go back n messages", it
> could lead to random disk IO for topic foo_i.
> 4. This solution would lead to an extra topic per real topic, since
> foo_i will not be able to encode a topic name in its message, or else
> messages would be variable sized.
> 5. Even if we assume you had the data correctly written to both
> topics, while consuming topic 'foo', you'd have to keep track of how
> many messages you have consumed, to be able to define an offset for
> topic foo_i. That means, in addition to a consumed offset, you'd have
> to keep track of number of messages consumed for that offset.
>
>
> I think this information could be maintained by the broker hosting the
> topic partition in a more consistent manner.
>
> Thanks,
> Neha
>
> On Thu, Sep 22, 2011 at 9:49 AM, Taylor Gautier <[EMAIL PROTECTED]>
> wrote:
> >
> > I'm not sure this is the same use case - here you just need to remember
> the
> > last good offset, which is only ever written by the last consumer.  If
> > things fail you just come back from the last good offset.
> >
> > There are many ways to store the last known good offset - in memory, in a
> > filesystem, in memcache, in the db, or in zk.  Using the simpleconsumer
> here
> > doesn't work as it's intended to make the simple thing work simply, but
> not
> > necessarily act as building blocks for more complex use cases, afaik.
> >
> > On Thu, Sep 22, 2011 at 9:41 AM, Evan Chan <[EMAIL PROTECTED]> wrote:
> >
> > > Hi everyone,
> > >
> > > I'd like to add a somewhat similar use case, for going back to a
> specific
> > > offset (maybe this will be addressed with the time indexing thing in
> Kafka
> > > 87, by the way, is any of the upcoming features documented?)
> > >
> > > Let's say I want to design a fault tolerant system around Kafka's
> ability
> > > to
> > > replay messages from a specific offset.
> > > A chain of consumers reads messages from kafka, and the last one dumps
> data
> > > into some database.
> > > What I want to achieve is, if any of the consumers fails, then a system
> > > detects the failure, and replays messages from a specific offset.
> > >
> > > How this can be achieved:
> > > 1) Instead of having the consumer reading from Kafka update ZK with the
> > > latest offset, I have the _last_ node in the consumer chain, the one
> that
> > > writes to the DB, update ZK with an offset.
> > > 2) A monitoring system detects node failures along the entire consumer