Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Using Kafka for "data" messages

Copy link to this message
Re: Using Kafka for "data" messages
You might want to take a look at mps (github.com/milindparikh/mps). As I
was thinking about blogging about it, your use-case magically surfaced on
the mailing list. Since one of the supposed benefits of having a framework
is to enable quick building of stuff on top of it, I took a crack at your
use case. I attribute it in Part II (
Part I (
is here.

mps is not in Java. So you may not be able to make use of the framework as
is. So I will try to explain the basic concept here; in the hopes that this
will be useful to you and others who might be interested in this sort of
thing. Firstly mps recognizes that kafka squeezes everything out of the
hard-disk and the os happiily laps away at the RAM as required. The two
other optimizable parameters are network and compute capacity. The concept
of topic and partition are taken up by kafka and therefore, at the
application level, if the utilization of kafka is to be optimized, one
shouldn't try to munge these concepts.

The other concept is that kafka treats messages as a unit. The fact that a
message has a key  (and a value) is mentioned "by the way" in kafka
documentation. Of course the partitioning relies on keys; but that's is
where it stops. This is fortunate; because it potentially gives a way of
managing messages at the higher level exploiting network and compute
capacity and leaves kafka to manage the disk and os the RAM; through
consistent hashing; so that you are not saturating the entire available
bandwidth and expending enormous computing capacity to search for the
proverbial needle in a haystack. I say potentially because I am in erlang
land where I know where mps can rest on shoulders of "giants".

On Fri, Jun 14, 2013 at 8:27 PM, Josh Foure <[EMAIL PROTECTED]> wrote:

> Hi Mahendra, thanks for your reply.  I was planning on using the
> Atmosphere Framework (http://async-io.org/)  to handle the web push stuff
> (I've never used it before but we use PrimeFaces a little and that's what
> they use for their components).  I thought that I would have the JVM that
> the user is connected to just be a Kafka consumer.  Given the topic
> limitations, I think I am back to having a single topic that all guest data
> is placed on and have all JVMs publish and consume to that same topic.  So
> it would look something like this:
> - I have 20 Web JVMs.
> - Every minute 100 people log in per JVM.  So 2,000 log ins per minute.
> Each Web JVM publishes a single message per log in.
> - My data services consume the log in event and then create about 1,000
> messages per user containing data about that user.  Each data message will
> probably be between 500 byte and 2k.  Let's assume an average of 1k per
> message so that would be 1 MB per user or about 2GB per minute.
> - The Recommendation service would consume all 2GB of data per minute and
> only end up using small amount of the data and then it would add it's
> recommendation messages to the same topic.
> - Each Web JVM also would consume the 2GB of data plus the handful of
> recommendation messages per minute and end up ignoring everything but the
> recommendation messages (especially since the 2GB represents the data for
> all the guests but each JVM only has 1/20 of the guest logged in).
> It seems wasteful to put 2 GB of data per minute in Kafka only to have the
> Recommendation service consume all this data and only end up using a few k
> of data and also have the web consume all this data when it just wants the
> few recommendation messages.  However, the benefit of using a single topic
> is that in the future other services could consume more of the data or the
> recommendation messages and since everything is on the same topic the order
> is guaranteed.  In our immediate use case we could put the recommendation
> messages on its own topics but in a sense we would be coupling our use case