Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka, mail # dev - Re: New Producer Public API

Copy link to this message
Re: New Producer Public API
David Arthur 2014-01-31, 14:02

On 1/30/14 8:18 PM, Joel Koshy wrote:
> That's a good point about 1A - does seem that we would need to have
> some kind of TTL for each topic's metadata.
> Also, WRT ZK dependency I don't think that decision (for the Java
> client) affects other clients. i.e., other client implementations can
> use whatever discovery mechanism it chooses. That said, I prefer not
> having a ZK dependency for the same reasons covered earlier in this
> thread.
FWIW, I think including ZK for broker discovery is a nice feature. Users
of kafka-python are constantly asking for something like this. If client
dependencies are a concern, then we could abstract the bootstrap
strategy into a simple pluggable interface so we could publish a
separate artifact. I could also imagine some AWS-specific bootstrap
strategy (e.g., get hosts from a particular security group, load
balancer/auto-scaling group, etc).

Or, we could just include ZK

> On Thu, Jan 30, 2014 at 4:34 PM, Jun Rao <[EMAIL PROTECTED]> wrote:
>> With option 1A, if we increase # partitions on a topic, how will the
>> producer find out newly created partitions? Do we expect the producer to
>> periodically call getCluster()?
>> As for ZK dependency, one of the goals of client rewrite is to reduce
>> dependencies so that one can implement the client in languages other than
>> java. ZK client is only available in a small number of languages.
>> Thanks,
>> Jun
>> On Fri, Jan 24, 2014 at 5:17 PM, Jay Kreps <[EMAIL PROTECTED]> wrote:
>>> Clark and all,
>>> I thought a little bit about the serialization question. Here are the
>>> options I see and the pros and cons I can think of. I'd love to hear
>>> people's preferences if you have a strong one.
>>> One important consideration is that however the producer works will also
>>> need to be how the new consumer works (which we hope to write next). That
>>> is if you put objects in, you should get objects out. So we need to think
>>> through both sides.
>>> Options:
>>> Option 0: What is in the existing scala code and the java code I
>>> posted--Serializer and Partitioner plugin provided by the user via config.
>>> Partitioner has a sane default, but Serializer needs to be specified in
>>> config.
>>> Pros: How it works today in the scala code.
>>> Cons: You have to adapt your serialization library of choice to our
>>> interfaces. The reflective class loading means typo in the serializer name
>>> give odd errors. Likewise there is little type safety--the ProducerRecord
>>> takes Object and any type errors between the object provided and the
>>> serializer give occurs at runtime.
>>> Option 1: No plugins
>>> This would mean byte[] key, byte[] value, and partitioning done by client
>>> by passing in a partition *number* directly.
>>> The problem with this is that it is tricky to compute the partition
>>> correctly and probably most people won't. We could add a getCluster()
>>> method to return the Cluster instance you should use for partitioning. But
>>> I suspect people would be lazy and not use that and instead hard-code
>>> partitions which would break if partitions were added or they hard coded it
>>> wrong. In my experience 3 partitioning strategies cover like 99% of cases
>>> so not having a default implementation for this makes the common case
>>> harder. Left to their own devices people will use bad hash functions and
>>> get weird results.
>>> Option 1A: Alternatively we could partition by the key using the existing
>>> default partitioning strategy which only uses the byte[] anyway but instead
>>> of having a partitionKey we could have a numerical partition override and
>>> add the getCluster() method to get the cluster metadata. That would make
>>> custom partitioning possible but handle the common case simply.
>>> Option 2: Partitioner plugin remains, serializers go.
>>> The problem here is that the partitioner might lose access to the
>>> deserialized key which would occasionally be useful for semantic