One other option is to use something like Druid, especially if you care
about doing arbitrary dimensional drilldowns.
It reads from Kafka and can do simple rollups for you automatically
(meaning you don't need storm if all you are doing with Storm is a simple
"group by" style rollup). If you need to do real-time joins of various
streams, it's fairly easy to use Storm to do that and push the data into
Druid as well.
Druid handles the delayed messages issue by allowing for a configurable
time window in which messages can be delayed (we run it with a 10 minute
window). Using Druid would be similar to Travis's setup, except it would
allow you to ingest the data in real-time and query the data as it is being
ingested, instead of having to wait for the persist to s3 and load into
Also, I'm biased about Druid ;).
On Fri, Aug 30, 2013 at 5:47 PM, Dan Di Spaltro <[EMAIL PROTECTED]>wrote:
> You could also use something more oriented at timeseries data like
> https://github.com/rackerlabs/blueflood/. Then you'd have to write some
> output adapters to feed the additional processing of your data elsewhere.
> I think the team is working on making an output adapter for Kafka for the
> rolled-up metrics (5m, 20m, 60m, etc). It has the capability to re-emit
> data when some arrives late.
> On Wed, Aug 28, 2013 at 7:55 AM, Travis Brady <[EMAIL PROTECTED]
> > This is a very common problem in my experience. Late-arriving and
> > semi-ordered data make a lot of stream processing problems more
> > Are you able to perform analysis with part of the data? For instance
> > buffering some number of events and then analyze?
> > How exactly do you know definitively that you've received *everything*
> > some time window?
> > Here's what I do (using Storm+Kafka+Redshift):
> > A Storm topo reads tuples from a Kafka topic and aggregates them using a
> > Guava Cache (which provides automatic time and size-based eviction)
> > At its simplest the key in the cache is the minute as a unix timestamp
> > the value is a count of events for that time window. In more complex
> > the key is a composite data type and the value might be a StreamSummary
> > HyperLogLog class from streamlib. Anyway, I configure the Cache to evict
> > entries once they've gone untouched for 30 seconds.
> > On eviction the data flows to S3 and from there to Redshift and that is
> > where we're able to get our canonical answer, because even if a certain
> > minute is composed of many records (due to data arriving late) in
> > we just aggregate over those records.
> > You may want to look at algebird, its mostly un-documented but provides a
> > lot of nice primitives for doing streaming aggregation.
> > Good luck.
> > On Wed, Aug 28, 2013 at 8:13 AM, Philip O'Toole <[EMAIL PROTECTED]>
> > > Well, you can only store data in Kafka, you can't put application logic
> > in
> > > there.
> > >
> > > Storm is good for processing data, but it is not a data store, so that
> > > out. Redis might work, but it is only an in-memory store (seems like it
> > > does have persistence, but I don't know much about that).
> > >
> > > You could try using Kafka and Storm to write the data to something like
> > > Cassandra or Elasticsearch, and perform your analysis later on the data
> > set
> > > as it lives in there.
> > >
> > > Philip
> > >
> > > On Aug 28, 2013, at 5:10 AM, Yavar Husain <[EMAIL PROTECTED]>
> > >
> > > > I have an application where I will be getting some Time Series data
> > > which I
> > > > am feeding to Kafka and Kafka in turn is giving data to Storm for
> > > > some real time processing.
> > > >
> > > > Now one of my use case is that there might be certain lag in my data.
> > For
> > > > an example: I might not get all the data for 2:00:00 PM all together.
> > > There
> > > > is a possibility that say all the data for 2:00:00 PM does not arrive