Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # dev >> compression performance

Copy link to this message
Re: compression performance
Sriram, I think I agree. Guozhang's proposal is clever but it exposes a lot
of complexity to the consumer. But I think it is good to have the complete

Chris, we will certainly not mess up the uncompressed case, don't worry. I
think your assumption is that compression needs to be slow. I think where
Sriram and I are coming from is that we think that if snappy can roundtrip
at 400MB/core cpu is not going to be a bottleneck and so this will be
"free". We think the issue you are seeing is really not due to compression
so much as it is due to silliness on our part. Previously that silliness
was on the producer side, where for us it was masked in 0.7 by the fact
that we have like 10,000 producers so the additional cpu wasn't super
noticable; obviously once you centralize that down to a few dozen brokers
the problem becomes quite acute. Even Guozhang's proposal would only remove
the recompression, the decompression is still there.

On Thu, Aug 15, 2013 at 7:50 PM, Chris Hogue <[EMAIL PROTECTED]> wrote:

> I would generally agree with the key goals you've suggested.
> I'm just coming to this discussion after some recent testing with 0.8 so I
> may be missing some background. The reference I found to this discussion is
> the JIRA issue below. Please let me know if there are others things I
> should look at.
> https://issues.apache.org/jira/browse/KAFKA-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> If I'm reading this correctly the reasoning behind removing compression
> from the producer is that its benefit (network bandwidth saved) is
> outweighed by the cost of the un-compress/re-compress on the broker. The
> issue with the odd heuristic about which codec to use on the broker makes
> sense.
> However I think the implied assumption that the broker will always
> un-compress and re-compress warrants discussion. This doesn't necessarily
> have to be the case as the approach outlined in this thread suggests. And
> if you remove that assumption you free up a lot of potential in the
> brokers.
> While one way to look at this is "we're already doing it on the broker, why
> do it on the producer", we came to it from the other angle, "we're already
> doing it on the producer, why have the broker do it again". I can certainly
> see cases where each would be appropriate.
> As noted in other threads removing compression from the broker's
> responsibility increased our throughput over 3x. This is still doing
> compression on the producer app, just outside of the kafka API, so it still
> benefits from the reduced network bandwidth the current built-in
> compression has.
> I appreciate the gains that can be had through optimizing the byte
> management in that code path. That seems like a good path to go down for
> the general case. But no matter how much you optimize it there's still
> going to be a non-trivial cost on the broker.
> So in an ideal world the Kafka APIs would have a built-in ability for us to
> choose at an application level whether we want the compression load to be
> on the producer or the broker. At a minimum I'm really hoping our ability
> to do that ourselves doesn't go away, especially if we're willing to take
> on the responsibility of batching/compressing. Said another way, we would
> at least need the optimized code path for uncompressed messages in
> ByteBufferMessageSet.assignOffsets() to stick around so that we can do it
> on our own.
> Thanks for all of the consideration here, it's a good discussion.
> -Chris
> On Thu, Aug 15, 2013 at 2:23 PM, Sriram Subramanian <
> > We need to first decide on the right behavior before optimizing on the
> > implementation.
> >
> > Few key goals that I would put forward are -
> >
> > 1. Decoupling compression codec of the producer and the log
> > 2. Ensuring message validity by the server on receiving bytes. This is
> > done by the iterator today and this is important to ensure bad data does