Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka >> mail # user >> Is 30 a too high partition number?


+
Aniket Bhatnagar 2013-10-04, 09:15
+
Neha Narkhede 2013-10-04, 13:35
+
Aniket Bhatnagar 2013-10-08, 14:29
+
Jason Rosenberg 2013-10-08, 16:31
Copy link to this message
-
Re: Is 30 a too high partition number?
I would like to second that. It would be real useful.

Philip

On Oct 8, 2013, at 9:31 AM, Jason Rosenberg <[EMAIL PROTECTED]> wrote:

> What I would like to see is a way for inactive topics to automatically get
> removed after they are inactive for a period of time.  That might help in
> this case.
>
> I added a comment to this larger jira:
> https://issues.apache.org/jira/browse/KAFKA-330
>
> Perhaps it should really be it's own jira entry.
>
> Jason
>
>
> On Tue, Oct 8, 2013 at 10:29 AM, Aniket Bhatnagar <
> [EMAIL PROTECTED]> wrote:
>
>> Thanks Neha. Is it worthwhile to investigate an option to store topic
>> metadata (partitions, etc) into another consistent data store (MySQL,
>> HBase, etc)? Should we make this feature pluggable?
>>
>> The reason I am thinking we may need to go surpass the 2000 total partition
>> limit is because there may be genuine use cases to have high number of
>> topics. For example, in my particular case, I am using Kafka as a buffer to
>> store data arriving from various sensors deployed in physical world. These
>> sensors may be short lived or may be long lived. I was thinking of having
>> individual topics for each sensor. This ways, if a badly behaving sensor
>> attempts to pushes the data at a much faster rate than we can process as a
>> Kafka consumer, we will eventually overflow and start losing data for that
>> particular sensor. However, we can still potentially continue to process
>> data from other sensors that are pushing data at manageable rate. If I go
>> with 1 topic for all the sensors, 1 misbehaving sensor can potentially lead
>> us not catching up with the topic in the retention period thus making us
>> loose data from all sensors.
>>
>> The other issue is that if we go with a topic per sensor and the sensors
>> are short lived and we have reached a threshold of 2000 sensors already
>> deployed, Kafka will stop working (because of Zookeeper limitation) if
>> though the previously deployed sensors may not be active at all.
>>
>> I am sure there may be other genuine use cases for having topics much
>> larger than 2000.
>>
>>
>> On 4 October 2013 19:04, Neha Narkhede <[EMAIL PROTECTED]> wrote:
>>
>>> You probably want to think of this in terms of number of partitions on a
>>> single broker, instead of per topic since I/O is the limiting factor in
>>> this case. Another factor to consider is total number of partitions in
>> the
>>> cluster as Zookeeper becomes a limiting factor there. 30 partitions is
>> not
>>> too large provided the total number of partitions doesn't exceed roughly
>>> couple thousand. To give you an example, some of our clusters are 16
>> nodes
>>> big and some of the topics on those clusters have 30 partitions.
>>>
>>> Thanks,
>>> Neha
>>> On Oct 4, 2013 4:15 AM, "Aniket Bhatnagar" <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>> I am using kafka as a buffer for data streaming in from various
>> sources.
>>>> Since its a time series data, I generate the key to the message by
>>>> combining source ID and minute in the timestamp. This means I can
>> utmost
>>>> have 60 partitions per topic (as each source has its own topic). I have
>>>> set num.partitions to be 30 (60/2) for each topic in broker config. I
>>> don't
>>>> have a very good reason to pick 30 as default number of partitions per
>>>> topic but I wanted it to be a high number so that I can achieve high
>>>> parallelism during in-stream processing. I am worried that having a
>> high
>>>> number  like 30 (default configuration had it as 2), it can negatively
>>>> impact kafka performance in terms of message throughput or memory
>>>> consumption. I understand that this can lead to many files per
>> partition
>>>> but I am thinking of dealing with it by having multiple directories on
>>> the
>>>> same disk if at all I run into issues.
>>>>
>>>> My question to the community is that am I prematurely attempting to
>>>> optimizing the partition number as right now even a partition number