Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Arguments for Kafka over RabbitMQ ?


Copy link to this message
-
Re: Arguments for Kafka over RabbitMQ ?
I am not making any assumptions other than Rabbit needs to maintain the
state of the consumers.  As the Kafka docs point out this is the
fundamental difference between most providers in the space and Kafka.

Thinking of a high throughput stream of messages and many active consumers
of different speeds, I am struggling with how Rabbit can avoid random I/O
with all the acks.  Each consumer’s state is certainly not linearly stored
on disk so there would have to be seeks.  Further log-structured merge
trees are used in NoSQL stores like Cassandra and are optimized for random
read access.  Why do you feel ‘Rabbit does not do lots of random IO?’

Looking at some docs on the Rabbit site they seem to mention that
performance degrades as the size of the persistent message store increases.
 Too much random I/O could certainly explain this degradation.

http://www.rabbitmq.com/blog/2011/09/24/sizing-your-rabbits/

The use case I have been talking about all along is a continuous firehose
of data with throughput in the 100s of thousands messages per second.   You
will have 10-20 consumers of different speeds ranging from real-time
(Storm) to batch (Hadoop).  This means the message store is in the 100s GBs
to terabytes range at all times.

-Jonathan

On Sat, Jun 8, 2013 at 2:09 PM, Alexis Richardson <
[EMAIL PROTECTED]> wrote:

> Jonathan
>
> I am aware of the difference between sequential writes and other kinds
> of writes ;p)
>
> AFAIK the Kafka docs describe a sort of platonic alternative system,
> eg "normally people do this.. Kafka does that..".  This is a good way
> to explain design decisions.  However, I think you may be assuming
> that Rabbit is a lot like the generalised other system.  But it is not
> - eg Rabbit does not do lots of random IO.  I'm led to understand that
> Rabbit's msg store is closer to log structured storage (a la
> Log-Structured Merge Trees) in some ways.  However, Rabbit does do
> more synchronous I/O, and has a different caching strategy (AFAIK).
> "It's complicated"
>
> In order to help provide useful info to the community, please could
> you describe a concrete test that we could discuss?  I think that
> would really help.  You mentioned a scenario with one large data set
> being streamed into the broker(s), and then consumed (in full?) by 2+
> consumers of wildly varying speeds.  Could you elaborate please?
>
> alexis
>
>
> Also, this is probably OT but I have never grokked this in the Design Doc:
>
> "Consumer rebalancing is triggered on each addition or removal of both
> broker nodes and other consumers within the same group. For a given
> topic and a given consumer group, broker partitions are divided evenly
> among consumers within the group."
>
> When a new consumer and/or partition appears, can messages in the
> broker get "moved" from one partition to another?
>
>
> On Sat, Jun 8, 2013 at 12:53 PM, Jonathan Hodges <[EMAIL PROTECTED]>
> wrote:
> > On Sat, Jun 8, 2013 at 2:09 AM, Jonathan Hodges <[EMAIL PROTECTED]>
> wrote:
> >> Thanks so much for your replies.  This has been a great help
> understanding
> >> Rabbit better with having very little experience with it.  I have a few
> >> follow up comments below.
> >
> > Happy to help!
> >
> > I'm afraid I don't follow your arguments below.  Rabbit contains many
> > optimisations too.  I'm told that it is possible to saturate the disk
> > i/o, and you saw the message rates I quoted in the previous email.
> > YES of course there are differences, mostly an accumulation of things.
> >  For example Rabbit spends more time doing work before it writes to
> > disk.
> >
> > It would be great if you can you detail some of the optimizations?  It
> > would seem to me Rabbit has much more overhead due to maintaining state
> of
> > the consumers as well as general messaging processing which makes it
> > impossible to manage the same write throughput as Kafka when you need to
> > persist large amounts of data to disk.  I definitely believe you that
> > Rabbit can saturate the disk but it is much more seek centric i.e. random