-Re: [jira] [Updated] (KAFKA-187) Add Snappy Compression as a Codec and refactor CompressionUtil and option on startup to select what the default codec
Jeffrey Damick 2011-11-13, 18:58
Yes, I don't disagree with need or feasibility of gzip and snappy, as we're
both agreeing a client spec is really what is lacking. How can I help? I
would think even just documenting the protocol on the wiki be a good start
(that would have helped me on the go client).
On Sat, Nov 12, 2011 at 4:05 PM, Jay Kreps <[EMAIL PROTECTED]> wrote:
> Hi Jeffrey,
> What you are saying makes sense. I agree that we need to give a client spec
> which is language agnostic. Currently I think we have reasonable support
> for non-java producers--they are easy to write and work just as well as
> java. We do not have good support for non-java consumers because the
> co-ordination algorithm is done client side which makes the consumer
> implementation complex. This is discussed a little here:
> I think with regard to compression we don't want to support gobs of
> compression algorithms, but we do want to give a few basic options. We
> discussed this a lot when we were originally designing Kafka, here was the
> thinking. Compression can be done in a couple ways. It could be internal to
> the message and purely a contract between the producer and consumer or it
> could be something handled only on the broker with messages compressed by
> the broker and decompressed when fetched. Here is what we came up with:
> 1. We want end-to-end compression. That is, the compression should be
> carried through the producer network hop, should be written compressed to
> disk, and should be fetched without needing decompression.
> 2. We want compression to be explicitly supported in the message/log
> format to enable "block" compression that compresses batches of messages.
> The reason for this is that this is much more effective then
> compression, especially for a stream where all messages share common
> fields. This is very common for many use cases.
> This means that compression does need to be something the client is aware
> of. For the codecs to support, we discussed this as well. We have only a
> single byte for the compression codec, which means we can't support an
> unbounded number of codecs and the support is in Kafka and is not meant to
> be user-pluggable. The reason for this is that we didn't feel that plugging
> in all possible algorithms really added any value. Instead we wanted to
> support a couple of useful CPU vs size trade-offs:
> 1. No Compression: This requires the least CPU (maybe) and has the
> largest data size.
> 2. GZIP: This has pretty good size but is very CPU intensive. This is
> appropriate for a lot of LinkedIn's uses where data is being transferred
> between datacenters and production comes from a very large number of
> producer processes and hence data size is much more important than CPU
> 3. LZO or Snappy are a nice intermediate between these extremes--good
> but not great compression with low CPU usage. We had thought of doing
> but snappy seems to be slightly better.
> At this point I don't see much use in adding additional compression types
> since there aren't many more useful spots on the CPU/size tradeoff curve.
> Because of the style of implementation each compression type does require
> support from both the producer and the consumers in each language. However
> lacking a compression type in one language is not a big impediment. If a
> given language doesn't support it, users of that client can just not use
> that compression type.
> My understanding is that snappy is available as fairly portable C so should
> be reasonable to embed in most common languages.
> Does that sound reasonable?
> On Sat, Nov 12, 2011 at 11:24 AM, Jeffrey Damick <[EMAIL PROTECTED]
> > RIght, but on the other hand if every compression under the sun is
> > then you could end up with a very fractured client community of support.
> > I guess I'd like to see a client RFC of sorts, but maybe I'm the only one