|
Jeffrey Damick
2011-11-11, 18:55
Chris Burroughs
2011-11-11, 19:22
Jeffrey Damick
2011-11-12, 19:24
Jay Kreps
2011-11-12, 21:05
Jeffrey Damick
2011-11-13, 18:58
Jun Rao
2011-11-14, 00:57
|
-
Re: [jira] [Updated] (KAFKA-187) Add Snappy Compression as a Codec and refactor CompressionUtil and option on startup to select what the default codecJeffrey Damick 2011-11-11, 18:55
So with regard to the
KAFKA-187<https://issues.apache.org/jira/browse/KAFKA-187> what is the stance going to be on supporting new compression methods? Is it expected that all clients 'must' & will support them? If not, is there a set of 'required' compression codecs? Jun mentioned not wanting every language to re-implement a thick client, but where is the line between thick and thin? It seems like there needs be a clear set of expectations for what a client implements, regardless of language or platform, or maybe I'm off in the weeds..
-
Re: [jira] [Updated] (KAFKA-187) Add Snappy Compression as a Codec and refactor CompressionUtil and option on startup to select what the default codecChris Burroughs 2011-11-11, 19:22
On 11/11/2011 01:55 PM, Jeffrey Damick wrote:
> So with regard to the > KAFKA-187<https://issues.apache.org/jira/browse/KAFKA-187> what > is the stance going to be on supporting new compression methods? Is it > expected that all clients 'must' & will support them? If not, is there a > set of 'required' compression codecs? Jun mentioned not wanting every > language to re-implement a thick client, but where is the line between > thick and thin? It seems like there needs be a clear set of expectations > for what a client implements, regardless of language or platform, or maybe > I'm off in the weeds.. > I think realistically if we try to say that we can only include compression codecs that every client language supports our only codec will be gzip (or maybe bzip2, but that's ill suited for most uses cases).
-
Re: [jira] [Updated] (KAFKA-187) Add Snappy Compression as a Codec and refactor CompressionUtil and option on startup to select what the default codecJeffrey Damick 2011-11-12, 19:24
RIght, but on the other hand if every compression under the sun is allowed,
then you could end up with a very fractured client community of support. I guess I'd like to see a client RFC of sorts, but maybe I'm the only one that cares about alternative language support... :) On Fri, Nov 11, 2011 at 2:22 PM, Chris Burroughs <[EMAIL PROTECTED]>wrote: > On 11/11/2011 01:55 PM, Jeffrey Damick wrote: > > So with regard to the > > KAFKA-187<https://issues.apache.org/jira/browse/KAFKA-187> what > > is the stance going to be on supporting new compression methods? Is it > > expected that all clients 'must' & will support them? If not, is there a > > set of 'required' compression codecs? Jun mentioned not wanting every > > language to re-implement a thick client, but where is the line between > > thick and thin? It seems like there needs be a clear set of expectations > > for what a client implements, regardless of language or platform, or > maybe > > I'm off in the weeds.. > > > > I think realistically if we try to say that we can only include > compression codecs that every client language supports our only codec > will be gzip (or maybe bzip2, but that's ill suited for most uses cases). >
-
Re: [jira] [Updated] (KAFKA-187) Add Snappy Compression as a Codec and refactor CompressionUtil and option on startup to select what the default codecJay Kreps 2011-11-12, 21:05
Hi Jeffrey,
What you are saying makes sense. I agree that we need to give a client spec which is language agnostic. Currently I think we have reasonable support for non-java producers--they are easy to write and work just as well as java. We do not have good support for non-java consumers because the co-ordination algorithm is done client side which makes the consumer implementation complex. This is discussed a little here: https://issues.apache.org/jira/browse/KAFKA-167 I think with regard to compression we don't want to support gobs of compression algorithms, but we do want to give a few basic options. We discussed this a lot when we were originally designing Kafka, here was the thinking. Compression can be done in a couple ways. It could be internal to the message and purely a contract between the producer and consumer or it could be something handled only on the broker with messages compressed by the broker and decompressed when fetched. Here is what we came up with: 1. We want end-to-end compression. That is, the compression should be carried through the producer network hop, should be written compressed to disk, and should be fetched without needing decompression. 2. We want compression to be explicitly supported in the message/log format to enable "block" compression that compresses batches of messages. The reason for this is that this is much more effective then single-message compression, especially for a stream where all messages share common fields. This is very common for many use cases. This means that compression does need to be something the client is aware of. For the codecs to support, we discussed this as well. We have only a single byte for the compression codec, which means we can't support an unbounded number of codecs and the support is in Kafka and is not meant to be user-pluggable. The reason for this is that we didn't feel that plugging in all possible algorithms really added any value. Instead we wanted to support a couple of useful CPU vs size trade-offs: 1. No Compression: This requires the least CPU (maybe) and has the largest data size. 2. GZIP: This has pretty good size but is very CPU intensive. This is appropriate for a lot of LinkedIn's uses where data is being transferred between datacenters and production comes from a very large number of producer processes and hence data size is much more important than CPU usage. 3. LZO or Snappy are a nice intermediate between these extremes--good but not great compression with low CPU usage. We had thought of doing LZO, but snappy seems to be slightly better. At this point I don't see much use in adding additional compression types since there aren't many more useful spots on the CPU/size tradeoff curve. Because of the style of implementation each compression type does require support from both the producer and the consumers in each language. However lacking a compression type in one language is not a big impediment. If a given language doesn't support it, users of that client can just not use that compression type. My understanding is that snappy is available as fairly portable C so should be reasonable to embed in most common languages. Does that sound reasonable? -Jay On Sat, Nov 12, 2011 at 11:24 AM, Jeffrey Damick <[EMAIL PROTECTED]>wrote: > RIght, but on the other hand if every compression under the sun is allowed, > then you could end up with a very fractured client community of support. > > I guess I'd like to see a client RFC of sorts, but maybe I'm the only one > that cares about alternative language support... :) > > > > On Fri, Nov 11, 2011 at 2:22 PM, Chris Burroughs > <[EMAIL PROTECTED]>wrote: > > > On 11/11/2011 01:55 PM, Jeffrey Damick wrote: > > > So with regard to the > > > KAFKA-187<https://issues.apache.org/jira/browse/KAFKA-187> what > > > is the stance going to be on supporting new compression methods? Is it > > > expected that all clients 'must' & will support them? If not, is
-
Re: [jira] [Updated] (KAFKA-187) Add Snappy Compression as a Codec and refactor CompressionUtil and option on startup to select what the default codecJeffrey Damick 2011-11-13, 18:58
Yes, I don't disagree with need or feasibility of gzip and snappy, as we're
both agreeing a client spec is really what is lacking. How can I help? I would think even just documenting the protocol on the wiki be a good start (that would have helped me on the go client). On Sat, Nov 12, 2011 at 4:05 PM, Jay Kreps <[EMAIL PROTECTED]> wrote: > Hi Jeffrey, > > What you are saying makes sense. I agree that we need to give a client spec > which is language agnostic. Currently I think we have reasonable support > for non-java producers--they are easy to write and work just as well as > java. We do not have good support for non-java consumers because the > co-ordination algorithm is done client side which makes the consumer > implementation complex. This is discussed a little here: > https://issues.apache.org/jira/browse/KAFKA-167 > > I think with regard to compression we don't want to support gobs of > compression algorithms, but we do want to give a few basic options. We > discussed this a lot when we were originally designing Kafka, here was the > thinking. Compression can be done in a couple ways. It could be internal to > the message and purely a contract between the producer and consumer or it > could be something handled only on the broker with messages compressed by > the broker and decompressed when fetched. Here is what we came up with: > > 1. We want end-to-end compression. That is, the compression should be > carried through the producer network hop, should be written compressed to > disk, and should be fetched without needing decompression. > 2. We want compression to be explicitly supported in the message/log > format to enable "block" compression that compresses batches of messages. > The reason for this is that this is much more effective then > single-message > compression, especially for a stream where all messages share common > fields. This is very common for many use cases. > > This means that compression does need to be something the client is aware > of. For the codecs to support, we discussed this as well. We have only a > single byte for the compression codec, which means we can't support an > unbounded number of codecs and the support is in Kafka and is not meant to > be user-pluggable. The reason for this is that we didn't feel that plugging > in all possible algorithms really added any value. Instead we wanted to > support a couple of useful CPU vs size trade-offs: > > 1. No Compression: This requires the least CPU (maybe) and has the > largest data size. > 2. GZIP: This has pretty good size but is very CPU intensive. This is > appropriate for a lot of LinkedIn's uses where data is being transferred > between datacenters and production comes from a very large number of > producer processes and hence data size is much more important than CPU > usage. > 3. LZO or Snappy are a nice intermediate between these extremes--good > but not great compression with low CPU usage. We had thought of doing > LZO, > but snappy seems to be slightly better. > > At this point I don't see much use in adding additional compression types > since there aren't many more useful spots on the CPU/size tradeoff curve. > > Because of the style of implementation each compression type does require > support from both the producer and the consumers in each language. However > lacking a compression type in one language is not a big impediment. If a > given language doesn't support it, users of that client can just not use > that compression type. > > My understanding is that snappy is available as fairly portable C so should > be reasonable to embed in most common languages. > > Does that sound reasonable? > > -Jay > > On Sat, Nov 12, 2011 at 11:24 AM, Jeffrey Damick <[EMAIL PROTECTED] > >wrote: > > > RIght, but on the other hand if every compression under the sun is > allowed, > > then you could end up with a very fractured client community of support. > > > > I guess I'd like to see a client RFC of sorts, but maybe I'm the only one
-
Re: [jira] [Updated] (KAFKA-187) Add Snappy Compression as a Codec and refactor CompressionUtil and option on startup to select what the default codecJun Rao 2011-11-14, 00:57
Jefferey,
There is already a wiki on Kafka compression. Feel free to extend it. Thanks, Jun On Sun, Nov 13, 2011 at 10:58 AM, Jeffrey Damick <[EMAIL PROTECTED]>wrote: > Yes, I don't disagree with need or feasibility of gzip and snappy, as we're > both agreeing a client spec is really what is lacking. How can I help? I > would think even just documenting the protocol on the wiki be a good start > (that would have helped me on the go client). > > > > > On Sat, Nov 12, 2011 at 4:05 PM, Jay Kreps <[EMAIL PROTECTED]> wrote: > > > Hi Jeffrey, > > > > What you are saying makes sense. I agree that we need to give a client > spec > > which is language agnostic. Currently I think we have reasonable support > > for non-java producers--they are easy to write and work just as well as > > java. We do not have good support for non-java consumers because the > > co-ordination algorithm is done client side which makes the consumer > > implementation complex. This is discussed a little here: > > https://issues.apache.org/jira/browse/KAFKA-167 > > > > I think with regard to compression we don't want to support gobs of > > compression algorithms, but we do want to give a few basic options. We > > discussed this a lot when we were originally designing Kafka, here was > the > > thinking. Compression can be done in a couple ways. It could be internal > to > > the message and purely a contract between the producer and consumer or it > > could be something handled only on the broker with messages compressed by > > the broker and decompressed when fetched. Here is what we came up with: > > > > 1. We want end-to-end compression. That is, the compression should be > > carried through the producer network hop, should be written compressed > to > > disk, and should be fetched without needing decompression. > > 2. We want compression to be explicitly supported in the message/log > > format to enable "block" compression that compresses batches of > messages. > > The reason for this is that this is much more effective then > > single-message > > compression, especially for a stream where all messages share common > > fields. This is very common for many use cases. > > > > This means that compression does need to be something the client is aware > > of. For the codecs to support, we discussed this as well. We have only a > > single byte for the compression codec, which means we can't support an > > unbounded number of codecs and the support is in Kafka and is not meant > to > > be user-pluggable. The reason for this is that we didn't feel that > plugging > > in all possible algorithms really added any value. Instead we wanted to > > support a couple of useful CPU vs size trade-offs: > > > > 1. No Compression: This requires the least CPU (maybe) and has the > > largest data size. > > 2. GZIP: This has pretty good size but is very CPU intensive. This is > > appropriate for a lot of LinkedIn's uses where data is being > transferred > > between datacenters and production comes from a very large number of > > producer processes and hence data size is much more important than CPU > > usage. > > 3. LZO or Snappy are a nice intermediate between these extremes--good > > but not great compression with low CPU usage. We had thought of doing > > LZO, > > but snappy seems to be slightly better. > > > > At this point I don't see much use in adding additional compression types > > since there aren't many more useful spots on the CPU/size tradeoff curve. > > > > Because of the style of implementation each compression type does require > > support from both the producer and the consumers in each language. > However > > lacking a compression type in one language is not a big impediment. If a > > given language doesn't support it, users of that client can just not use > > that compression type. > > > > My understanding is that snappy is available as fairly portable C so > should > > be reasonable to embed in most common languages. |