Kafka, mail # user - Re: Using Kafka for "data" messages - 2013-06-15, 03:28
Solr & Elasticsearch trainings in New York & San Francisco [more info][hide]
 Search Hadoop and all its subprojects:

Switch to Threaded View
Copy link to this message
Re: Using Kafka for "data" messages
Hi Mahendra, thanks for your reply.  I was planning on using the Atmosphere Framework (http://async-io.org/)  to handle the web push stuff (I've never used it before but we use PrimeFaces a little and that's what they use for their components).  I thought that I would have the JVM that the user is connected to just be a Kafka consumer.  Given the topic limitations, I think I am back to having a single topic that all guest data is placed on and have all JVMs publish and consume to that same topic.  So it would look something like this:

- I have 20 Web JVMs.
- Every minute 100 people log in per JVM.  So 2,000 log ins per minute.  Each Web JVM publishes a single message per log in.
- My data services consume the log in event and then create about 1,000 messages per user containing data about that user.  Each data message will probably be between 500 byte and 2k.  Let's assume an average of 1k per message so that would be 1 MB per user or about 2GB per minute.
- The Recommendation service would consume all 2GB of data per minute and only end up using small amount of the data and then it would add it's recommendation messages to the same topic.
- Each Web JVM also would consume the 2GB of data plus the handful of recommendation messages per minute and end up ignoring everything but the recommendation messages (especially since the 2GB represents the data for all the guests but each JVM only has 1/20 of the guest logged in).

It seems wasteful to put 2 GB of data per minute in Kafka only to have the Recommendation service consume all this data and only end up using a few k of data and also have the web consume all this data when it just wants the few recommendation messages.  However, the benefit of using a single topic is that in the future other services could consume more of the data or the recommendation messages and since everything is on the same topic the order is guaranteed.  In our immediate use case we could put the recommendation messages on its own topics but in a sense we would be coupling our use case to our choice of topics.  If we want the web to start also showing a little bit of the data from the data messages, we would be back to consuming the 2GB of data in the Web JVMs.

Traditionally we would just have the Web call a service on the Recommendation system (possibly asynchronously) which in turn would call the database to load just the data it needs.  But we are thinking that by publishing all the data we have about the user (whether in the immediate future the existing systems need that data or not, future systems might) we are creating a system where we can easily add new consumers to do new things with all this data.  The main downside seems to be that most of the consumers are processing millions of messages that they have no interest in.  Do you think that the benefits outweigh the cons?  Is there a better way to achieve similar results?

 From: Mahendra M <[EMAIL PROTECTED]>
Sent: Friday, June 14, 2013 8:03 AM
Subject: Re: Using Kafka for "data" messages
Hi Josh,

Thanks for clarifying the use case. The idea is good, but I see the following three issues
1. Creating a queue for each user. There could be limits on this

2. Removing old queues
3. If the same user logs in from multiple browsers, things get a bit more complex.Can I suggest an alternate approach than using Kafka?

Using a combination of Kafka and XMPP-BOSH/Comet for this.
1. User logs in. Message is sent on a Kafka queue.
2. Web browser starts a long polling connection to a server (XMPP-BOSH / Comet)
3. Consumers pick up message in (1) and do their job. They push their results to a results queue and to an XMPP end-point ([EMAIL PROTECTED])
4. Recommender can pick up from the results queue and push it's result to the XMPP end-point
5. Web front-end picks up the messages and does the displaying job.
If you plan it more, you can avoid using Kafka in this use case and just do with XMPP (for steps 1 and 3)

Also, you don't have to take care of large number of queues, removing them etc. Also XMPP is really good in handling multiple end-points for a single user. (There are good XMPP servers like ejabberd and tigase. Also good lightweight JS libraries for handling connections).

PS: I think my reply is going off-topic. So, I will stop.


On Thu, Jun 13, 2013 at 11:17 PM, Josh Foure <[EMAIL PROTECTED]> wrote:

Hi Mahendra, I think that is where it gets a little tricky.  I think it would work something like this:

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB