|
|
-
A few questions from a new user
graham sanderson 2012-07-07, 17:18
1) I would like to guarantee that a group of messages are always delivered in their entirety together (because there is contextual information in messages which precede other messages). I'm a little confused by the use of the term "nested message sets" since I don't really see much in the code (though II don't really know Scala) - perhaps this refers to the fact that you can have a set of messages within a message set file on disk. Anyway, I was curious (and I'm using the Java api now, but may move to the Scala later) what I need to do to guarantee N messages are sent and delivered as a single message set; is a single ProducerData with a List of messages always sent as a single message set? does compression need to be turned on? how does this affect network limits etc. (i.e. does the entire message set have to fit). I'm also assuming that once I have my message set containing all my messages it will be discarded in its entirety.
2) Related to 1) from the consumer side, can I tell the boundaries of a message set (perhaps not required for me), but nevertheless I do want to make sure I receive the entire set in one go (again do I have to set network limits accordingly). The docs say that the entire message set is always delivered to the client when compressed, but I'm not sure if it can be subdivided if not compressed. Note I'm happy to stick with compression if required.
3) So I'm using the ZookeeperConsumerConnector, since I don't want to manage finding the brokers myself, however I was wondering if there are any plans to decouple the consumer offset tracking from the former. One of my use cases is that I'll have a lot of ad-hoc one off consumers that simply read a subset of data until they die - from looking at ConsoleConsumer, there is currently a hack to simply delete the zookeeper info after the fact to get around this.
Thanks,
Graham.
-
Re: A few questions from a new user
Niek Sanders 2012-07-07, 20:30
Graham,
If you have a collection of data that should always be sent and consumed together and in-order, why not send it using a single Kafka message? Or is the payload really huge?
- Niek
On Sat, Jul 7, 2012 at 10:18 AM, graham sanderson <[EMAIL PROTECTED]> wrote: > 1) I would like to guarantee that a group of messages are always delivered in their entirety together (because there is contextual information in messages which precede other messages). I'm a little confused by the use of the term "nested message sets" since I don't really see much in the code (though II don't really know Scala) - perhaps this refers to the fact that you can have a set of messages within a message set file on disk. Anyway, I was curious (and I'm using the Java api now, but may move to the Scala later) what I need to do to guarantee N messages are sent and delivered as a single message set; is a single ProducerData with a List of messages always sent as a single message set? does compression need to be turned on? how does this affect network limits etc. (i.e. does the entire message set have to fit). I'm also assuming that once I have my message set containing all my messages it will be discarded in its entirety. > > 2) Related to 1) from the consumer side, can I tell the boundaries of a message set (perhaps not required for me), but nevertheless I do want to make sure I receive the entire set in one go (again do I have to set network limits accordingly). The docs say that the entire message set is always delivered to the client when compressed, but I'm not sure if it can be subdivided if not compressed. Note I'm happy to stick with compression if required. > > 3) So I'm using the ZookeeperConsumerConnector, since I don't want to manage finding the brokers myself, however I was wondering if there are any plans to decouple the consumer offset tracking from the former. One of my use cases is that I'll have a lot of ad-hoc one off consumers that simply read a subset of data until they die - from looking at ConsoleConsumer, there is currently a hack to simply delete the zookeeper info after the fact to get around this. > > Thanks, > > Graham.
-
Re: A few questions from a new user
Graham Sanderson 2012-07-07, 23:37
Thanks Niek good point; I can certainly do that and limit the payload arbitrarily or based on size (I just need to include my "contextual" information again in the next message) or latency requirements (which I was planning to do for the message set too)
I guess I sort of thought that messagesets might be a built in concept that might do what I wanted already
Since many of my payloads may be smaller than the message header anyway, this will probably save me a bunch of space too
On Jul 7, 2012, at 3:30 PM, Niek Sanders <[EMAIL PROTECTED]> wrote:
> Graham, > > If you have a collection of data that should always be sent and > consumed together and in-order, why not send it using a single Kafka > message? Or is the payload really huge? > > - Niek > > > > On Sat, Jul 7, 2012 at 10:18 AM, graham sanderson <[EMAIL PROTECTED]> wrote: >> 1) I would like to guarantee that a group of messages are always delivered in their entirety together (because there is contextual information in messages which precede other messages). I'm a little confused by the use of the term "nested message sets" since I don't really see much in the code (though II don't really know Scala) - perhaps this refers to the fact that you can have a set of messages within a message set file on disk. Anyway, I was curious (and I'm using the Java api now, but may move to the Scala later) what I need to do to guarantee N messages are sent and delivered as a single message set; is a single ProducerData with a List of messages always sent as a single message set? does compression need to be turned on? how does this affect network limits etc. (i.e. does the entire message set have to fit). I'm also assuming that once I have my message set containing all my messages it will be discarded in its entirety. >> >> 2) Related to 1) from the consumer side, can I tell the boundaries of a message set (perhaps not required for me), but nevertheless I do want to make sure I receive the entire set in one go (again do I have to set network limits accordingly). The docs say that the entire message set is always delivered to the client when compressed, but I'm not sure if it can be subdivided if not compressed. Note I'm happy to stick with compression if required. >> >> 3) So I'm using the ZookeeperConsumerConnector, since I don't want to manage finding the brokers myself, however I was wondering if there are any plans to decouple the consumer offset tracking from the former. One of my use cases is that I'll have a lot of ad-hoc one off consumers that simply read a subset of data until they die - from looking at ConsoleConsumer, there is currently a hack to simply delete the zookeeper info after the fact to get around this. >> >> Thanks, >> >> Graham.
-
Re: A few questions from a new user
Jay Kreps 2012-07-09, 16:13
1,2. We don't really provide guarantees outside the message boundary. You are right that in the current implementation the nested message sets we use for compression do actually fullfill this purpose. Our intention with these was really to allow batch compression, and they don't really have any advantage over putting multiple records in a single kafka message. Likewise, our delivery guarantees are at the message boundary.
3. We would be interested in decoupling the offset tracking and also in adding some facility to garbage collect unused groups (e.g. if no update to a group for > 1 week, delete it). This feature doesn't exist though. If anyone wants to add it, it would probably be pretty easy--just a background task in the broker that periodically scanned the groups in zk.
-Jay
On Sat, Jul 7, 2012 at 10:18 AM, graham sanderson <[EMAIL PROTECTED]> wrote:
> 1) I would like to guarantee that a group of messages are always delivered > in their entirety together (because there is contextual information in > messages which precede other messages). I'm a little confused by the use of > the term "nested message sets" since I don't really see much in the code > (though II don't really know Scala) - perhaps this refers to the fact that > you can have a set of messages within a message set file on disk. Anyway, I > was curious (and I'm using the Java api now, but may move to the Scala > later) what I need to do to guarantee N messages are sent and delivered as > a single message set; is a single ProducerData with a List of messages > always sent as a single message set? does compression need to be turned on? > how does this affect network limits etc. (i.e. does the entire message set > have to fit). I'm also assuming that once I have my message set containing > all my messages it will be discarded in its entirety. > > 2) Related to 1) from the consumer side, can I tell the boundaries of a > message set (perhaps not required for me), but nevertheless I do want to > make sure I receive the entire set in one go (again do I have to set > network limits accordingly). The docs say that the entire message set is > always delivered to the client when compressed, but I'm not sure if it can > be subdivided if not compressed. Note I'm happy to stick with compression > if required. > > 3) So I'm using the ZookeeperConsumerConnector, since I don't want to > manage finding the brokers myself, however I was wondering if there are any > plans to decouple the consumer offset tracking from the former. One of my > use cases is that I'll have a lot of ad-hoc one off consumers that simply > read a subset of data until they die - from looking at ConsoleConsumer, > there is currently a hack to simply delete the zookeeper info after the > fact to get around this. > > Thanks, > > Graham.
-
Re: A few questions from a new user
graham sanderson 2012-07-09, 17:51
Thanks Jay,
Yes, w.r.t. 3, yes I later saw the client redesign doc which does mention not tracking/balancing consumers for a topic, along with a possible non-blocking consumer
+1 for all of those
One particular use case of interest to me (along with the standard data bus for loading into say HDFS), involve many parts of our system watching the flow of traffic over certain topics and partitions (that they can figure out based on what they are looking for). Particularly in cases where we may be wanting to learn something statistical (estimates) from the data in real time, the playback from the past feature is one of the handiest features of kafka - if we spin up a new box, it can simply playback the last N hours of data . We are also (will be) highly into filtering, with small int values (GCed/reused over time) providing a lookup into zookeeper for metadata including, source machine, application, message encoding (e.g. AVRO) and schema and other metadata. This allows us to do simple bit set tests on the message header (and indeed with messages full of other messages) to trivially reject messages that don't interest us (over and above us having rejected topics we don't care about). Sometimes we will need to listen on all partitions of a topic, and potentially multiple topics, so the non-blocking consumer feature would be nice, but lower priority since we can easily enough listen on multiple threads.
We also plan to feed low level and potentially high volume instrumentation/log events, timed periods etc into kafka (though not with the intention of it actually being persisted after the fact), (Again playback is wonderful for seeing what happened this morning). Here filtering can be very efficient, because if you're looking for sessionId="foo", then you can skip any messages that don't have sessionId in their schema, also based on the pluggable encoding, do 1) a byte[] level grep for "foo", 2) if that passes a more semantic grep. Note with AVRO being our choice, and having the schema defined by the first word, we avoid repetitious encoding of field names. Also knowing the schema we can do more interesting filters like price < 10000. Anyway the only reason I mention it (and we're at the very early stages) is that we may want to find a way to push this filtering away from our clients (to save them downloading then immediately discarding) whether that being into kafka in some way, or simply via a set of services which share a fast network with the kafka cluster.
Thanks,
Graham
On Jul 9, 2012, at 11:13 AM, Jay Kreps wrote:
> 1,2. We don't really provide guarantees outside the message boundary. You > are right that in the current implementation the nested message sets we use > for compression do actually fullfill this purpose. Our intention with these > was really to allow batch compression, and they don't really have any > advantage over putting multiple records in a single kafka message. > Likewise, our delivery guarantees are at the message boundary. > > 3. We would be interested in decoupling the offset tracking and also in > adding some facility to garbage collect unused groups (e.g. if no update to > a group for > 1 week, delete it). This feature doesn't exist though. If > anyone wants to add it, it would probably be pretty easy--just a background > task in the broker that periodically scanned the groups in zk. > > -Jay > > On Sat, Jul 7, 2012 at 10:18 AM, graham sanderson <[EMAIL PROTECTED]> wrote: > >> 1) I would like to guarantee that a group of messages are always delivered >> in their entirety together (because there is contextual information in >> messages which precede other messages). I'm a little confused by the use of >> the term "nested message sets" since I don't really see much in the code >> (though II don't really know Scala) - perhaps this refers to the fact that >> you can have a set of messages within a message set file on disk. Anyway, I >> was curious (and I'm using the Java api now, but may move to the Scala
|
|