Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka >> mail # user >> Client improvement discussion


+
Jay Kreps 2013-07-26, 19:00
+
Jason Rosenberg 2013-07-26, 21:46
+
Xavier Stevens 2013-07-26, 22:41
+
Chris Riccomini 2013-07-28, 03:13
+
Jay Kreps 2013-07-29, 04:58
+
Sybrandy, Casey 2013-07-29, 13:03
+
Chris Hogue 2013-08-02, 19:29
+
Jay Kreps 2013-08-02, 19:50
Copy link to this message
-
Re: Client improvement discussion
Great comments, answers inline!

On Fri, Aug 2, 2013 at 12:28 PM, Chris Hogue <[EMAIL PROTECTED]> wrote:
Cool.

I think even in 0.7 there was only one thread, right?

Cool, yeah currently you must use the simple consumer to get that which is
a pain.

I'm not 100% sure, but I believe the compression can still be done inline.
The compression algorithm will buffer a bit, of course. What we currently
do though is write out the full data uncompressed and then compress it.
This is pretty inefficient. Basically we are using Java's OutputStream apis
for compression but we need to be using the lower-level array oriented
algorithms like (Deflater). I haven't tried this but my assumption is that
we can compress the messages as they arrive into the destination buffer
instead of the current approach.

Yes, it is a bummer. We think ultimately this does make sense though, for
two reasons beyond offsets:
1. You have to validate the integrity of the data the client has sent to
you or else one bad or buggy client can screw up all consumers.
2. The compression of the log should not be tied to the compression used by
individual producers. We haven't made this change yet, but it is an easy
one. The problem today is that if your producers send a variety of
compression types your consumers need to handle the union of all types and
you have no guarantee over what types producers may send in the future.
Instead we think these should be decoupled. The topic should have a
compression type property and that should be totally decoupled from the
compression type the producer uses. In many cases there is no real need for
the producer to use compression at all as the real thing you want to
optimize is later inter-datacenter transfers no the network send to the
local broker so the producer can just send uncompressed and have the broker
control the compression type.

The performance really has two causes though:
1. GZIP is super slow, especially java's implementation. But snappy, for
example, is actually quite fast. We should be able to do snappy at network
speeds according to the perf data I have seen, but...
2. ...our current compression code is kind of inefficient due to all the
copying and traversal, due to the reasons cited above.

So in other words I think we can make this a bit better but it probably
won't go away. How do you feel about snappy?

We can't really do this because we are multi-writer so any offset we give
the client would potentially be used by another producer and then be
invalid or non-sequential.

 
+
Chris Hogue 2013-08-02, 23:56
+
Jay Kreps 2013-08-03, 02:42
+
Chris Hogue 2013-08-03, 13:50
+
Tommy Messbauer 2013-07-29, 15:28
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB