Thanks so much for your replies. This has been a great help understanding
Rabbit better with having very little experience with it. I have a few
follow up comments below.
I am not sure why you think this is a problem?
For a fixed number of producers and consumers, the pubsub and delivery
semantics of Rabbit and Kafka are quite similar. Think of Rabbit as
adding an in-memory cache that is used to (a) speed up read
consumption, (b) obviate disk writes when possible due to all client
consumers being available and consuming.
Actually I think this is the main use case that sets Kafka apart from
Rabbit and speaks to the poster’s ‘Arguments for Kafka over RabbitMQ’
question. As you mentioned Rabbit is a general purpose messaging system
and along with that has a lot of features not found in Kafka. There are
plenty of times when Rabbit makes more sense than Kafka, but not when you
are maintaining large message stores and require high throughput to disk.
Persisting 100s GBs of messages to disk is a much different problem than
managing messages in memory. Since Rabbit must maintain the state of the
consumers I imagine it’s subjected to random data access patterns on disk
as opposed to sequential. Quoting the Kafka design page (http://kafka.apache.org/07/design.html)
performance of sequential writes on
a 6 7200rpm SATA RAID-5 array is about 300MB/sec but the performance of
random writes is only about 50k/sec—a difference of nearly 10000X.
They go on to say persistent data structure used in messaging systems
metadata is often a BTree. BTrees are the most versatile data structure
available, and make it possible to support a wide variety of transactional
and non-transactional semantics in the messaging system. They do come with
a fairly high cost, though: Btree operations are O(log N). Normally O(log
N) is considered essentially equivalent to constant time, but this is not
true for disk operations. Disk seeks come at 10 ms a pop, and each disk can
do only one seek at a time so parallelism is limited. Hence even a handful
of disk seeks leads to very high overhead. Since storage systems mix very
fast cached operations with actual physical disk operations, the observed
performance of tree structures is often superlinear. Furthermore BTrees
require a very sophisticated page or row locking implementation to avoid
locking the entire tree on each operation. The implementation must pay a
fairly high price for row-locking or else effectively serialize all reads.
Because of the heavy reliance on disk seeks it is not possible to
effectively take advantage of the improvements in drive density, and one is
forced to use small (< 100GB) high RPM SAS drives to maintain a sane ratio
of data to seek capacity.
Intuitively a persistent queue could be built on simple reads and appends
to files as is commonly the case with logging solutions. Though this
structure would not support the rich semantics of a BTree implementation,
but it has the advantage that all operations are O(1) and reads do not
block writes or each other. This has obvious performance advantages since
the performance is completely decoupled from the data size--one server can
now take full advantage of a number of cheap, low-rotational speed 1+TB
SATA drives. Though they have poor seek performance, these drives often
have comparable performance for large reads and writes at 1/3 the price and
3x the capacity.
Having access to virtually unlimited disk space without penalty means that
we can provide some features not usually found in a messaging system. For
example, in kafka, instead of deleting a message immediately after
consumption, we can retain messages for a relative long period (say a week).
Our assumption is that the volume of messages is extremely high, indeed it
is some multiple of the total number of page views for the site (since a
page view is one of the activities we process). Furthermore we assume each
message published is read at least once (and often multiple times), hence
we optimize for consumption rather than production.
There are two common causes of inefficiency: too many network requests, and
excessive byte copying.
To encourage efficiency, the APIs are built around a "message set"
abstraction that naturally groups messages. This allows network requests to
group messages together and amortize the overhead of the network roundtrip
rather than sending a single message at a time.
The MessageSet implementation is itself a very thin API that wraps a byte
array or file. Hence there is no separate serialization or deserialization
step required for message processing, message fields are lazily
deserialized as needed (or not deserialized if not needed).
The message log maintained by the broker is itself just a directory of
message sets that have been written to disk. This abstraction allows a
single byte format to be shared by both the broker and the consumer (and to
some degree the producer, though producer messages are checksumed and
validated before being added to the log).
Maintaining this common format allows optimization of the most important
operation: network transfer of persistent log chunks. Modern unix operating
systems offer a highly optimized code path for transferring data out of
pagecache to a socket; in Linux this is done with the sendfile system call.
Java provides access to this system call with the FileChannel.transferTo
To understand the impact of sendfile, it is important to understand the
common data path for transfer of data from file to socket:
1. The operating system reads data from the disk into pagecache in kernel
2. The application reads the data from kernel space into a user-space
3. The application writes the data back into kernel space into a socket
4. The operating system copies the data from the socket buffer to the NIC
buffer where it is sent over the network
This is clearly inefficient, there are four copies, two system calls. Using