Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka, mail # dev - Client improvement discussion


Copy link to this message
-
Re: Client improvement discussion
Jason Rosenberg 2013-07-26, 21:46
Jay,

This seems like a great direction.  Simplifying the consumer client would
be a big win, and +1 for more native java client integration.

On the last point, regarding memory usage for buffering per partition.  I
would think it could be possible to devise a dynamic queuing system, to
allow higher volume partitions to have larger effective buffers than
smaller, low-volume partitions.  Thus, if you reserve a fixed
total.buffer.memory, you could allocate units of buffer space which could
then be composed to make larger buffers (perhaps not necessarily
contiguous).  The long-tail of low-volume partitions could also be moved to
some sort of auxiliary, non-collated buffer space, as they are less likely
to benefit from contiguous buffering anyway.

Fun stuff.

Jason

Jason
On Fri, Jul 26, 2013 at 3:00 PM, Jay Kreps <[EMAIL PROTECTED]> wrote:

> I sent around a wiki a few weeks back proposing a set of client
> improvements that essentially amount to a rewrite of the producer and
> consumer java clients.
>
> https://cwiki.apache.org/confluence/display/KAFKA/Client+Rewrite
>
> The below discussion assumes you have read this wiki.
>
> I started to do a little prototyping for the producer and wanted to share
> some of the ideas that came up to get early feedback.
>
> First, a few simple but perhaps controversial things to discuss.
>
> Rollout
> Phase 1: We add the new clients. No change on the server. Old clients still
> exist. The new clients will be entirely in a new package so there will be
> no possibility of name collision.
> Phase 2: We swap out all shared code on the server to use the new client
> stuff. At this point the old clients still exist but are essentially
> deprecated.
> Phase 3: We remove the old client code.
>
> Java
> I think we should do the clients in java. Making our users deal with
> scala's non-compatability issues and crazy stack traces causes people a lot
> of pain. Furthermore we end up having to wrap everything now to get a
> usable java api anyway for non-scala people. This does mean maintaining a
> substantial chunk of java code, which is maybe less fun than scala. But
> basically i think we should optimize for the end user and produce a
> standalone pure-java jar with no dependencies.
>
> Jars
> We definitely want to separate out the client jar. There is also a fair
> amount of code shared between both (exceptions, protocol definition, utils,
> and the message set implementation). Two approaches.
> Two jar approach: split kafka.jar into kafka-clients.jar and
> kafka-server.jar with the server depending on the clients. The advantage of
> this is that it is simple. The disadvantage is that things like utils and
> protocol definition will be in the client jar though technical they belong
> equally to the server.
> Many jar approach: split kafka.jar into kafka-common.jar,
> kafka-producer.jar, kafka-consumer.jar, kafka-admin.jar, and
> kafka-server.jar. The disadvantage of this is that the user needs two jars
> (common + something) which is for sure going to confuse people. I also
> think this will tend to spawn more jars over time.
>
> Background threads
> I am thinking of moving both serialization and compression out of the
> background send thread. I will explain a little about this idea below.
>
> Serialization
> I am not sure if we should handle serialization in the client at all.
> Basically I wonder if our own API wouldn't just be a lot simpler if we took
> a byte[] key and byte[] value and let people serialize stuff themselves.
> Injecting a class name for us to create the serializer is more roundabout
> and has a lot of problems if the serializer itself requires a lot of
> configuration or other objects to be instantiated.
>
> Partitioning
> The real question with serialization is whether the partitioning should
> happen on the java object or on the byte array key. The argument for doing
> it on the java object is that it is easier to do something like a range
> partition on the object. The problem with doing it on the object is that