Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka >> mail # dev >> Re: [jira] [Updated] (KAFKA-187) Add Snappy Compression as a Codec and refactor CompressionUtil and option on startup to select what the default codec


+
Jeffrey Damick 2011-11-11, 18:55
+
Chris Burroughs 2011-11-11, 19:22
+
Jeffrey Damick 2011-11-12, 19:24
+
Jay Kreps 2011-11-12, 21:05
Copy link to this message
-
Re: [jira] [Updated] (KAFKA-187) Add Snappy Compression as a Codec and refactor CompressionUtil and option on startup to select what the default codec
Yes, I don't disagree with need or feasibility of gzip and snappy, as we're
both agreeing a client spec is really what is lacking.  How can I help?  I
would think even just documenting the protocol on the wiki be a good start
(that would have helped me on the go client).
On Sat, Nov 12, 2011 at 4:05 PM, Jay Kreps <[EMAIL PROTECTED]> wrote:

> Hi Jeffrey,
>
> What you are saying makes sense. I agree that we need to give a client spec
> which is language agnostic. Currently I think we have reasonable support
> for non-java producers--they are easy to write and work just as well as
> java. We do not have good support for non-java consumers because the
> co-ordination algorithm is done client side which makes the consumer
> implementation complex. This is discussed a little here:
> https://issues.apache.org/jira/browse/KAFKA-167
>
> I think with regard to compression we don't want to support gobs of
> compression algorithms, but we do want to give a few basic options. We
> discussed this a lot when we were originally designing Kafka, here was the
> thinking. Compression can be done in a couple ways. It could be internal to
> the message and purely a contract between the producer and consumer or it
> could be something handled only on the broker with messages compressed by
> the broker and decompressed when fetched. Here is what we came up with:
>
>   1. We want end-to-end compression. That is, the compression should be
>   carried through the producer network hop, should be written compressed to
>   disk, and should be fetched without needing decompression.
>   2. We want compression to be explicitly supported in the message/log
>   format to enable "block" compression that compresses batches of messages.
>   The reason for this is that this is much more effective then
> single-message
>   compression, especially for a stream where all messages share common
>   fields. This is very common for many use cases.
>
> This means that compression does need to be something the client is aware
> of. For the codecs to support, we discussed this as well. We have only a
> single byte for the compression codec, which means we can't support an
> unbounded number of codecs and the support is in Kafka and is not meant to
> be user-pluggable. The reason for this is that we didn't feel that plugging
> in all possible algorithms really added any value. Instead we wanted to
> support a couple of useful CPU vs size trade-offs:
>
>   1. No Compression: This requires the least CPU (maybe) and has the
>   largest data size.
>   2. GZIP: This has pretty good size but is very CPU intensive. This is
>   appropriate for a lot of LinkedIn's uses where data is being transferred
>   between datacenters and production comes from a very large number of
>   producer processes and hence data size is much more important than CPU
>   usage.
>   3. LZO or Snappy are a nice intermediate between these extremes--good
>   but not great compression with low CPU usage. We had thought of doing
> LZO,
>   but snappy seems to be slightly better.
>
> At this point I don't see much use in adding additional compression types
> since there aren't many more useful spots on the CPU/size tradeoff curve.
>
> Because of the style of implementation each compression type does require
> support from both the producer and the consumers in each language. However
> lacking a compression type in one language is not a big impediment. If a
> given language doesn't support it, users of that client can just not use
> that compression type.
>
> My understanding is that snappy is available as fairly portable C so should
> be reasonable to embed in most common languages.
>
> Does that sound reasonable?
>
> -Jay
>
> On Sat, Nov 12, 2011 at 11:24 AM, Jeffrey Damick <[EMAIL PROTECTED]
> >wrote:
>
> > RIght, but on the other hand if every compression under the sun is
> allowed,
> > then you could end up with a very fractured client community of support.
> >
> > I guess I'd like to see a client RFC of sorts, but maybe I'm the only one
+
Jun Rao 2011-11-14, 00:57
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB