1) We are building near-realtime monitoring application so we need to be
able to drain from *HEARTBEAT *topic as soon as possible. We have 6000
servers dumping heartbeats messages at rate of ~1MB per minute from each
servers ? How can I partition a message such way that consumer can drain
as soon as message arrives ? Basically, we have to meet 30 seconds SLA
from time message was created to consumer (we just insert message into
cassandra ) for our monitoring tools ) I was thinking about partitioning
by HOSTNAME as partitioning key so we can have 6000 consumers draining
from each partition ? Can a new partition be created dynamically when new
server is added and a consumer be started automatically ? Please recommend
6000 servers sending 1MB/min isn't a lot of data. The number of partitions
for the topic would depend largely on the throughput of your consumer,
since that involves writing to Cassandra. Since the # of partitions can be
increased on the fly without downtime, I recommend you start small, measure
your consumer's throughput and then based on that increase the # of
partitions to match the # of consumers you would have to deploy, to easily
keep up with the load. The 30 second SLA can be tricky since you are
batching for 1 min on the producers, so you will see a delay of 1min on
every batch. If you set the right configs for batching on the producer
side, you could bring down the end-to-end latency down to few 100s of ms.
2) With near-realtime requirement, if the consumer is lag behind (lets
say by 5 minutes or more due to downtime or code roll or etc ), when
consumer group restarts , I would like to consume first latest messages (to
meet the SLA) before consuming from last offset ? How can I achieve this ?
Also, I need to be able to backfill messages.
It seems like the new consumer group starting from the latest message will
work, provided you have a way to stop the old consumer group once it has
caught up. You might have to place this logic in your application code.
However, this will be much easier to do once we have our new consumer
the new APIs provide an easy way to "seek" to certain offsets and
provide enough information for you know to when the consumer has caught up
to a certain offset.
I also highly recommend using the soon-to-be-released 0.8.1.1 instead of
Hope that helps.
On Thu, Apr 10, 2014 at 8:47 AM, Bhavesh Mistry