-Re: Is 30 a too high partition number?
Philip O'Toole 2013-10-08, 16:35
I would like to second that. It would be real useful.
On Oct 8, 2013, at 9:31 AM, Jason Rosenberg <[EMAIL PROTECTED]> wrote:
> What I would like to see is a way for inactive topics to automatically get
> removed after they are inactive for a period of time. That might help in
> this case.
> I added a comment to this larger jira:
> Perhaps it should really be it's own jira entry.
> On Tue, Oct 8, 2013 at 10:29 AM, Aniket Bhatnagar <
> [EMAIL PROTECTED]> wrote:
>> Thanks Neha. Is it worthwhile to investigate an option to store topic
>> metadata (partitions, etc) into another consistent data store (MySQL,
>> HBase, etc)? Should we make this feature pluggable?
>> The reason I am thinking we may need to go surpass the 2000 total partition
>> limit is because there may be genuine use cases to have high number of
>> topics. For example, in my particular case, I am using Kafka as a buffer to
>> store data arriving from various sensors deployed in physical world. These
>> sensors may be short lived or may be long lived. I was thinking of having
>> individual topics for each sensor. This ways, if a badly behaving sensor
>> attempts to pushes the data at a much faster rate than we can process as a
>> Kafka consumer, we will eventually overflow and start losing data for that
>> particular sensor. However, we can still potentially continue to process
>> data from other sensors that are pushing data at manageable rate. If I go
>> with 1 topic for all the sensors, 1 misbehaving sensor can potentially lead
>> us not catching up with the topic in the retention period thus making us
>> loose data from all sensors.
>> The other issue is that if we go with a topic per sensor and the sensors
>> are short lived and we have reached a threshold of 2000 sensors already
>> deployed, Kafka will stop working (because of Zookeeper limitation) if
>> though the previously deployed sensors may not be active at all.
>> I am sure there may be other genuine use cases for having topics much
>> larger than 2000.
>> On 4 October 2013 19:04, Neha Narkhede <[EMAIL PROTECTED]> wrote:
>>> You probably want to think of this in terms of number of partitions on a
>>> single broker, instead of per topic since I/O is the limiting factor in
>>> this case. Another factor to consider is total number of partitions in
>>> cluster as Zookeeper becomes a limiting factor there. 30 partitions is
>>> too large provided the total number of partitions doesn't exceed roughly
>>> couple thousand. To give you an example, some of our clusters are 16
>>> big and some of the topics on those clusters have 30 partitions.
>>> On Oct 4, 2013 4:15 AM, "Aniket Bhatnagar" <[EMAIL PROTECTED]>
>>>> I am using kafka as a buffer for data streaming in from various
>>>> Since its a time series data, I generate the key to the message by
>>>> combining source ID and minute in the timestamp. This means I can
>>>> have 60 partitions per topic (as each source has its own topic). I have
>>>> set num.partitions to be 30 (60/2) for each topic in broker config. I
>>>> have a very good reason to pick 30 as default number of partitions per
>>>> topic but I wanted it to be a high number so that I can achieve high
>>>> parallelism during in-stream processing. I am worried that having a
>>>> number like 30 (default configuration had it as 2), it can negatively
>>>> impact kafka performance in terms of message throughput or memory
>>>> consumption. I understand that this can lead to many files per
>>>> but I am thinking of dealing with it by having multiple directories on
>>>> same disk if at all I run into issues.
>>>> My question to the community is that am I prematurely attempting to
>>>> optimizing the partition number as right now even a partition number