I am using kafka as a buffer for data streaming in from various sources.
Since its a time series data, I generate the key to the message by
combining source ID and minute in the timestamp. This means I can utmost
have 60 partitions per topic (as each source has its own topic). I have
set num.partitions to be 30 (60/2) for each topic in broker config. I don't
have a very good reason to pick 30 as default number of partitions per
topic but I wanted it to be a high number so that I can achieve high
parallelism during in-stream processing. I am worried that having a high
number like 30 (default configuration had it as 2), it can negatively
impact kafka performance in terms of message throughput or memory
consumption. I understand that this can lead to many files per partition
but I am thinking of dealing with it by having multiple directories on the
same disk if at all I run into issues.
My question to the community is that am I prematurely attempting to
optimizing the partition number as right now even a partition number of 5
seems sufficient and hence will run into unwanted issues? Or is 30 an Ok
number to use for number of partitions?