Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # dev >> Random Partitioning Issue


Copy link to this message
-
Re: Random Partitioning Issue
Joe,

Thanks for bringing this up. I want to clarify this a bit.

1. Currently, the producer side logic is that if the partitioning key is
not provided (i.e., it is null), the partitioner won't be called. We did
that because we want to select a random and "available" partition to send
messages so that if some partitions are temporarily unavailable (because of
broker failures), messages can still be sent to other partitions. Doing
this in the partitioner is difficult since the partitioner doesn't know
which partitions are currently available (the DefaultEventHandler does).

2. As Joel said, the common use case in production is that there are many
more producers than #partitions in a topic. In this case, sticking to a
partition for a few minutes is not going to cause too much imbalance in the
partitions and has the benefit of reducing the # of socket connections. My
feeling is that this will benefit most production users. In fact, if one
uses a hardware load balancer for producing data in 0.7, it behaves in
exactly the same way (a producer will stick to a broker until the reconnect
interval is reached).

3. It is true that If one is testing a topic with more than one partition
(which is not the default value), this behavior can be a bit weird.
However, I think it can be mitigated by running multiple test producer
instances.

4. Someone reported in the mailing list that all data shows in only one
partition after a few weeks. This is clearly not the expected behavior. We
can take a closer look to see if this is real issue.

Do you think these address your concerns?

Thanks,

Jun

On Sat, Sep 14, 2013 at 11:18 AM, Joe Stein <[EMAIL PROTECTED]> wrote:

> How about creating a new class called RandomRefreshPartioner and copy the
> DefaultPartitioner code to it and then revert the DefaultPartitioner code.
>  I appreciate this is a one time burden for folks using the existing
> 0.8-beta1 bumping into KAFKA-1017 in production having to switch to the
> RandomRefreshPartioner and when folks deploy to production will have to
> consider this property change.
>
> I make this suggestion keeping in mind the new folks that on board with
> Kafka and when everyone is in development and testing mode for the first
> time their experience would be as expected from how it would work in
> production this way.  In dev/test when first using Kafka they won't have so
> many producers for partitions but would look to parallelize their consumers
> IMHO.
>
> The random broker change sounds like maybe a bigger change now this late
> in the release cycle if we can accommodate folks trying Kafka for the first
> time and through their development and testing along with full blown
> production deploys.
>
> /*******************************************
>  Joe Stein
>  Founder, Principal Consultant
>  Big Data Open Source Security LLC
>  http://www.stealth.ly
>  Twitter: @allthingshadoop
> ********************************************/
>
>
> On Sep 14, 2013, at 8:17 AM, Joel Koshy <[EMAIL PROTECTED]> wrote:
>
> >>
> >>
> >> Thanks for bringing this up - it is definitely an important point to
> >> discuss. The underlying issue of KAFKA-1017 was uncovered to some
> degree by
> >> the fact that in our deployment we did not significantly increase the
> total
> >> number of partitions over 0.7 - i.e., in 0.7 we had say four partitions
> per
> >> broker, now we are using (say) eight partitions across the cluster. So
> with
> >> random partitioning every producer would end up connecting to nearly
> every
> >> broker (unlike 0.7 in which we would connect to only one broker within
> each
> >> reconnect interval). In a production-scale deployment that causes the
> high
> >> number of connections that KAFKA-1017 addresses.
> >>
> >> You are right that the fix of sticking to one partition over the
> metadata
> >> refresh interval goes against true consumer parallelism, but this would
> be
> >> the case only if there are few producers. If you have a sizable number
> of
> >> producers on average all partitions would get uniform volumes of data.