Kafka, mail # user - Using Kafka for "data" messages - 2013-06-13, 15:22
Solr & Elasticsearch trainings in New York & San Francisco [more info][hide]
 Search Hadoop and all its subprojects:

Switch to Threaded View
Copy link to this message
Using Kafka for "data" messages
Hi all, my team is proposing a novel
way of using Kafka and I am hoping someone can help do a sanity check on this:
1.  When a user logs
into our website, we will create a “logged in” event message in Kafka
containing the user id. 
2.  30+ systems
(consumers each in their own consumer groups) will consume this event and
lookup data about this user id.  They
will then publish all of this data back out into Kafka as a series of data messages.  One message may include the user’s name,
another the user’s address, another the user’s last 10 searches, another their
last 10 orders, etc.  The plan is that a
single “logged in” event may trigger hundreds if not thousands of additional data
3.  Another system,
the “Product Recommendation” system, will have consumed the original “logged in”
message and will also consume a subset of the data messages (realistically I
think it would need to consume all of the data messages but would discard the
ones it doesn’t need).  As the Product
Recommendation consumes the data messages, it will process recommended products
and publish out recommendation messages (that get more and more specific as it
has consumed more and more data messages).
4.  The original
website will consume the recommendation messages and show the recommendations to
the user as it gets them.
You don’t see many systems implemented this way but since
Kafka has such a higher throughput than your typical MOM, this approach seems
The benefits are:
1.  If we start
collecting more information about the users, we can simply start publishing
that in new data messages and consumers can start processing those messages
whenever they want.  If we were doing
this in a more traditional SOA approach the schemas would need to change every time
we added a field but with this approach we can just create new messages without
touching existing ones.
2.  We are looking to
make our systems smaller so if we end up with more, smaller systems that each
publish a small number of events, it becomes easier to make changes and test
the changes.  If we were doing this in a
more traditional SOA approach we would need to retest each consumer every time
we changed our bigger SOA services.
The downside appears to be:
1.  We may be
publishing a large amount of data that never gets used but that everyone needs
to consume to see if they need it before discarding it.
2.  The Product Recommendation
system may need to wait until it consumes a number of messages and keep track
of all the data internally before it can start processing.
3.  While we may be
able to keep the messages somewhat small, the fact that they contain data will
mean they will be bigger than your tradition EDA messages.
4.  It seems like we
can do a lot of this using SOA (we already have an ESB than can do
transformations to address consumers expecting an older version of the data).
Any insight is appreciated.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB