Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka >> mail # dev >> Re: [jira] [Updated] (KAFKA-187) Add Snappy Compression as a Codec and refactor CompressionUtil and option on startup to select what the default codec

Jeffrey Damick 2011-11-11, 18:55
Chris Burroughs 2011-11-11, 19:22
Jeffrey Damick 2011-11-12, 19:24
Jay Kreps 2011-11-12, 21:05
Copy link to this message
Re: [jira] [Updated] (KAFKA-187) Add Snappy Compression as a Codec and refactor CompressionUtil and option on startup to select what the default codec
Yes, I don't disagree with need or feasibility of gzip and snappy, as we're
both agreeing a client spec is really what is lacking.  How can I help?  I
would think even just documenting the protocol on the wiki be a good start
(that would have helped me on the go client).
On Sat, Nov 12, 2011 at 4:05 PM, Jay Kreps <[EMAIL PROTECTED]> wrote:

> Hi Jeffrey,
> What you are saying makes sense. I agree that we need to give a client spec
> which is language agnostic. Currently I think we have reasonable support
> for non-java producers--they are easy to write and work just as well as
> java. We do not have good support for non-java consumers because the
> co-ordination algorithm is done client side which makes the consumer
> implementation complex. This is discussed a little here:
> https://issues.apache.org/jira/browse/KAFKA-167
> I think with regard to compression we don't want to support gobs of
> compression algorithms, but we do want to give a few basic options. We
> discussed this a lot when we were originally designing Kafka, here was the
> thinking. Compression can be done in a couple ways. It could be internal to
> the message and purely a contract between the producer and consumer or it
> could be something handled only on the broker with messages compressed by
> the broker and decompressed when fetched. Here is what we came up with:
>   1. We want end-to-end compression. That is, the compression should be
>   carried through the producer network hop, should be written compressed to
>   disk, and should be fetched without needing decompression.
>   2. We want compression to be explicitly supported in the message/log
>   format to enable "block" compression that compresses batches of messages.
>   The reason for this is that this is much more effective then
> single-message
>   compression, especially for a stream where all messages share common
>   fields. This is very common for many use cases.
> This means that compression does need to be something the client is aware
> of. For the codecs to support, we discussed this as well. We have only a
> single byte for the compression codec, which means we can't support an
> unbounded number of codecs and the support is in Kafka and is not meant to
> be user-pluggable. The reason for this is that we didn't feel that plugging
> in all possible algorithms really added any value. Instead we wanted to
> support a couple of useful CPU vs size trade-offs:
>   1. No Compression: This requires the least CPU (maybe) and has the
>   largest data size.
>   2. GZIP: This has pretty good size but is very CPU intensive. This is
>   appropriate for a lot of LinkedIn's uses where data is being transferred
>   between datacenters and production comes from a very large number of
>   producer processes and hence data size is much more important than CPU
>   usage.
>   3. LZO or Snappy are a nice intermediate between these extremes--good
>   but not great compression with low CPU usage. We had thought of doing
> LZO,
>   but snappy seems to be slightly better.
> At this point I don't see much use in adding additional compression types
> since there aren't many more useful spots on the CPU/size tradeoff curve.
> Because of the style of implementation each compression type does require
> support from both the producer and the consumers in each language. However
> lacking a compression type in one language is not a big impediment. If a
> given language doesn't support it, users of that client can just not use
> that compression type.
> My understanding is that snappy is available as fairly portable C so should
> be reasonable to embed in most common languages.
> Does that sound reasonable?
> -Jay
> On Sat, Nov 12, 2011 at 11:24 AM, Jeffrey Damick <[EMAIL PROTECTED]
> >wrote:
> > RIght, but on the other hand if every compression under the sun is
> allowed,
> > then you could end up with a very fractured client community of support.
> >
> > I guess I'd like to see a client RFC of sorts, but maybe I'm the only one
Jun Rao 2011-11-14, 00:57