Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Analysis of Data


Copy link to this message
-
Re: Analysis of Data
Hi Steven,
Thanks for chiming in! Please see my responses inline:

On Thu, Feb 7, 2013 at 3:04 PM, Steven Yates <[EMAIL PROTECTED]>wrote:

> The only missing link within the Flume architecture I see in this
> conversation is the actual channel's and brokers themselves which
> orchestrate this lovely undertaking of data collection.
>

Can you define what you mean by channels and brokers in this context? Since
channel is a synonym for queueing event buffer in Flume parlance. Also, can
you elaborate more on what you mean by orchestration? I think I know where
you're going but I don't want to put words in your mouth.

One opportunity I do see (and I may be wrong) is for the data to offloaded
> into a system such as Apache Mahout  before being sent to the sink. Perhaps
> the concept of a ChannelAdapter of sorts? I.e Mahout Adapter ? Just
> thinking out loud and it may be well out of the question.
>

Why not a Mahout sink? Since Mahout often wants sequence files in a
particular format to begin its MapReduce processing (e.g. its k-Means
clustering implementation), Flume is already a good fit with its HDFS sink
and EventSerializers allowing for writing a plugin to format your data
however it needs to go in. In fact that works today if you have a batch
(even 5-minute batch) use case. With today's functionality, you could use
Oozie to coordinate kicking off the Mahout M/R job periodically, as new
data becomes available and the files are rolled.

Perhaps even more interestingly, I can see a use case where you might want
to use Mahout to do streaming / realtime updates driven by Flume in the
form of an interceptor or a Mahout sink. If online machine learning (e.g.
stochastic gradient descent or something else online) was what you were
thinking, I wonder if there are any folks on this list who might have an
interest in helping to work on putting such a thing together.

In any case, I'd like to hear more about specific use cases for streaming
analytics. :)

Regards,
Mike
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB