Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka, mail # user - implications of using large number of topics....


+
Jason Rosenberg 2012-10-10, 22:57
+
Neha Narkhede 2012-10-10, 23:05
+
Jason Rosenberg 2012-10-10, 23:12
+
Jay Kreps 2012-10-10, 23:25
+
Taylor Gautier 2012-10-11, 03:13
+
Mathias Söderberg 2012-10-11, 13:43
+
Jun Rao 2012-10-12, 05:48
Copy link to this message
-
Re: implications of using large number of topics....
Jason Rosenberg 2012-10-12, 17:55
Has there ever been a thought to better handle a large number of topics?
 Prior discussions?  Or would it likely be too great of a change to the way
kafka works, no matter what?

I'm wondering if there's a way to have a notion of multiple "virtual"
topics which are internally managed as members of a single topic "group",
but which at the api level, appear to be unique topics, from the client
perspective.

Naturally, it would be straightforward to implement something like this by
wrapping the current client apis, but I'm wondering if there's any benefit
to building it into the internals.  This would still have the downside that
a client subscribing to a virtual topic would have to, under the covers,
sift through lots of messages it's not interested in.

Any other interesting approaches?

Jason
On Thu, Oct 11, 2012 at 10:48 PM, Jun Rao <[EMAIL PROTECTED]> wrote:

> Mathias,
>
> What matters is the total # partitions since each corresponds to a separate
> directory on disk. It doesn't matter how may topics those partitions are
> from.
>
> Thanks,
>
> Jun
>
> On Thu, Oct 11, 2012 at 6:43 AM, Mathias Söderberg <
> [EMAIL PROTECTED]> wrote:
>
> > Hey all,
> >
> > This is a quite interesting topic (no pun intended), and I've seen it
> come
> > up at least once before.
> >
> > Me and a friend started experimenting with Kafka and ZooKeeper a little
> > while ago (building a publisher / subscriber system with consistent
> hashing
> > and whatnot) and currently we're using around 300 topics, all with one
> > partition each. So far we haven't really done any serious performance
> > testing, but I'm planning to do so in the following weeks. But I've got a
> > few questions regardless:
> >
> >
> > Does / should it make any difference in performance when one has a lot of
> > topics compared to having one topic with a lot of partitions? I'm
> imagining
> > that the system still needs to keep the same number of file descriptors
> > open, but I'm not sure how this would affect reads and writes? Are we
> going
> > to run into more random reads and writes by using a lot of topics
> compared
> > to using one topic with a lot of partitions instead? Can't really wrap my
> > head around this right now, mostly because of my rather limited knowledge
> > about how disks and page caches work.
> >
> > Could also add that we're mostly doing sequential reads (in rare cases we
> > have to rewind a topic) and that the number of topics doesn't change.
> >
> > On 11 October 2012 05:13, Taylor Gautier <[EMAIL PROTECTED]> wrote:
> >
> > > We used pattern #1 at Tagged.  I wouldn't recommend it unless you're
> > really
> > > committed.  It took a lot of work to get it working right.
> > >
> > > a) Performance degraded non-linearly (read it fell off a cliff) when
> > > brokers were managing more than about 20k topics.  This was on a Linux
> > RHEL
> > > 5.3 system with EXT3.  YMMV.
> > >
> > > b) Startup time is significantly longer for a broker that is restarted
> > due
> > > to communication with ZK to sync up on those topics.
> > >
> > > c) If topics are short lived, even if Kafka expires the data segments
> > using
> > > it's standard 0.7 cleaner, the directory name for the topic will still
> > > exist on disk and the topic is still considered "active" (in memory) in
> > > Kafka.  This causes problems - see a above (open file handles).
> > >
> > > d) Message latency is affected.  Kafka syncs messages to disk if x
> > messages
> > > have buffered in memory, or y seconds have elapsed (both configurable).
> >  If
> > > you have few topics and many messages (pattern #2), you will be hitting
> > the
> > > x limit quite often, and get good throughput.  However, if you have
> many
> > > topics and few messages per topic (pattern #1), you will have to rely
> on
> > > the y threshold to flush to disk, and setting this too low can impact
> > > performance (throughput) in a significant way.  Jay already mentioned
> > this
> > > as random writes.
> > >
> > > We had to implement a number of solutions ourselves to resolve these
+
Jun Rao 2012-10-14, 03:37
+
Jason Rosenberg 2012-10-14, 06:42
+
Jun Rao 2012-10-15, 18:42
+
Jason Rosenberg 2012-10-11, 16:24
+
Taylor Gautier 2012-10-11, 17:08