-Re: New Producer Public API
Jay Kreps 2014-01-31, 00:14
One downside to the 1A proposal is that without a Partitioner interface we
can't really package up and provide common partitioner implementations.
Example of these would be
1. HashPartitioner - The default hash partitioning
2. RoundRobinPartitioner - Just round-robins over partitions
3. ConnectionMinimizingPartitioner - Choose partitions to minimize the
number of nodes you need to connect maintain TCP connections to.
4. RangePartitioner - User provides break points that align partitions to
5. LocalityPartitioner - Prefer nodes on the same rack. This would be nice
for stream-processing use cases that read from one topic and write to
another. We would have to include rack information in our metadata.
Having this kind of functionality included is actually kind of nice.
On Fri, Jan 24, 2014 at 5:17 PM, Jay Kreps <[EMAIL PROTECTED]> wrote:
> Clark and all,
> I thought a little bit about the serialization question. Here are the
> options I see and the pros and cons I can think of. I'd love to hear
> people's preferences if you have a strong one.
> One important consideration is that however the producer works will also
> need to be how the new consumer works (which we hope to write next). That
> is if you put objects in, you should get objects out. So we need to think
> through both sides.
> Option 0: What is in the existing scala code and the java code I
> posted--Serializer and Partitioner plugin provided by the user via config.
> Partitioner has a sane default, but Serializer needs to be specified in
> Pros: How it works today in the scala code.
> Cons: You have to adapt your serialization library of choice to our
> interfaces. The reflective class loading means typo in the serializer name
> give odd errors. Likewise there is little type safety--the ProducerRecord
> takes Object and any type errors between the object provided and the
> serializer give occurs at runtime.
> Option 1: No plugins
> This would mean byte key, byte value, and partitioning done by client
> by passing in a partition *number* directly.
> The problem with this is that it is tricky to compute the partition
> correctly and probably most people won't. We could add a getCluster()
> method to return the Cluster instance you should use for partitioning. But
> I suspect people would be lazy and not use that and instead hard-code
> partitions which would break if partitions were added or they hard coded it
> wrong. In my experience 3 partitioning strategies cover like 99% of cases
> so not having a default implementation for this makes the common case
> harder. Left to their own devices people will use bad hash functions and
> get weird results.
> Option 1A: Alternatively we could partition by the key using the existing
> default partitioning strategy which only uses the byte anyway but instead
> of having a partitionKey we could have a numerical partition override and
> add the getCluster() method to get the cluster metadata. That would make
> custom partitioning possible but handle the common case simply.
> Option 2: Partitioner plugin remains, serializers go.
> The problem here is that the partitioner might lose access to the
> deserialized key which would occasionally be useful for semantic
> partitioning schemes. The Partitioner could deserialize the key but that
> would be inefficient and weird.
> This problem could be fixed by having key and value be byte but
> retaining partitionKey as an Object and passing it to the partitioner as
> is. Then if you have a partitioner which requires the deserialized key you
> would need to use this partition key. One weird side effect is that if you
> want to have a custom partition key BUT want to partition by the bytes of
> that key rather than the object value you must write a customer partitioner
> and serialize it yourself.
> Of these I think I prefer 1A but could be convinced of 0 since that is how
> it works now.