|
|
-
Re: Question about offsetsDavid Arthur 2013-01-30, 13:54
Thanks, Jay. Makes sense
+1 for documentation. I think in general the wiki is a bit unorganized, makes it hard to know what is the "current" info vs old proposals. -David On 1/30/13 12:01 AM, Jay Kreps wrote: > Yes, offsets are now logical 0,1,2,3... > > Some details on this change are here: > https://cwiki.apache.org/confluence/display/KAFKA/Keyed+Messages+Proposal > https://issues.apache.org/jira/browse/KAFKA-506 > https://issues.apache.org/jira/browse/KAFKA-631 > > There was some previous threads on this that discussed the benefits. The > main goals included: > 1. Allow key-based retention policy in addition to time-based retention. > This allows the topic to retain the latest value for each message key for > data streams that contain updates. > 2. Fix the semantics of compressed messages (which are weird in 0.7 because > the compressed messages essentially have no offset and are unaddressable > hence commit() in the middle of a compressed message set will give > duplicates!). > 3. Allow odd iteration patterns (skipping ahead N messages, reading > messages backwards, etc) > > It is also just nicer for humans and less prone to weird errors. > > To answer your questions: > > 1. Lookups are done using a memory mapped index file kept with the log that > maps logical offsets to byte offsets. This index is sparse (need not > contain every message). Once the begin and end position in the log segment > are found the data transfer happens the same as before. > > 2. The rule is that all messages have offsets stored with the messages. > These are assigned by the server. So in your example, the offsets of B1, B2 > will be sequentially assigned and the wrapper message will have the offset > of the last message in the compressed set. > > As a result in 0.8 requesting any offset in the compressed message set will > yield that compressed messageset (you need the whole thing to decompress). > This means the java client now works correctly with commit() and compressed > message sets--if you commit in the middle of a compressed message set and > restart from that point you will read the compressed set again, throw away > the messages with offset lower than your consumers offset, and then start > consuming each message after that. > > Hope that makes sense. > > We really need to get this stuff documented since it is all floating around > on various wikis and JIRAs. > > -Jay > > > On Tue, Jan 29, 2013 at 8:38 PM, David Arthur <[EMAIL PROTECTED]> wrote: > >> Fiddling with my Python client on 0.8, noticed something has changed with >> offsets. >> >> It seems that instead of a byte offset in the log file, the offset is now >> a logical one. I had a few questions about this: >> >> 1) How is the byte offset determined by the broker? Since messages are not >> fixed width, does it use an index or do a simple binary search? >> 2) Regarding compressed MessageSets, how are the offsets incremented? >> Suppose I have the following MessageSet >> >> MessageSet A >> - Message A1, normal message >> - Message A2, normal message >> - Message A3, compressed MessageSet B >> - MessageSet B >> - Message B1 >> - Message B2 >> >> Assuming we start from 0, message A1 gets offset of 0, A2 of 1. Now I am >> unclear how the numbering goes. Would A3 get offset 3, and B1 -> 4 B2 -> 5? >> Or are the offsets inside a compressed MessageSet not used? >> >> Is it possible to request a message inside a compressed message set? >> >> Also what about nested compression sets, what if Message B3 is itself a >> compressed MessageSet (not that it makes sense, just curious what would >> happen). >> >> Thanks! >> -David >> >> >> >> |