Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume, mail # user - Analysis of Data


+
Surindhar 2013-02-07, 09:52
+
Nitin Pawar 2013-02-07, 10:15
+
Surindhar 2013-02-07, 10:24
+
Bertrand Dechoux 2013-02-07, 10:30
+
Inder Pall 2013-02-07, 10:39
+
Mike Percy 2013-02-07, 10:59
+
Nitin Pawar 2013-02-07, 11:22
+
Steven Yates 2013-02-07, 23:04
Copy link to this message
-
Re: Analysis of Data
Mike Percy 2013-02-08, 03:00
Hi Steven,
Thanks for chiming in! Please see my responses inline:

On Thu, Feb 7, 2013 at 3:04 PM, Steven Yates <[EMAIL PROTECTED]>wrote:

> The only missing link within the Flume architecture I see in this
> conversation is the actual channel's and brokers themselves which
> orchestrate this lovely undertaking of data collection.
>

Can you define what you mean by channels and brokers in this context? Since
channel is a synonym for queueing event buffer in Flume parlance. Also, can
you elaborate more on what you mean by orchestration? I think I know where
you're going but I don't want to put words in your mouth.

One opportunity I do see (and I may be wrong) is for the data to offloaded
> into a system such as Apache Mahout  before being sent to the sink. Perhaps
> the concept of a ChannelAdapter of sorts? I.e Mahout Adapter ? Just
> thinking out loud and it may be well out of the question.
>

Why not a Mahout sink? Since Mahout often wants sequence files in a
particular format to begin its MapReduce processing (e.g. its k-Means
clustering implementation), Flume is already a good fit with its HDFS sink
and EventSerializers allowing for writing a plugin to format your data
however it needs to go in. In fact that works today if you have a batch
(even 5-minute batch) use case. With today's functionality, you could use
Oozie to coordinate kicking off the Mahout M/R job periodically, as new
data becomes available and the files are rolled.

Perhaps even more interestingly, I can see a use case where you might want
to use Mahout to do streaming / realtime updates driven by Flume in the
form of an interceptor or a Mahout sink. If online machine learning (e.g.
stochastic gradient descent or something else online) was what you were
thinking, I wonder if there are any folks on this list who might have an
interest in helping to work on putting such a thing together.

In any case, I'd like to hear more about specific use cases for streaming
analytics. :)

Regards,
Mike
+
Mike Percy 2013-02-08, 02:46
+
Steve Yates 2013-02-08, 03:22
+
Nitin Pawar 2013-02-08, 04:55
+
Inder Pall 2013-02-08, 08:48
+
Mike Percy 2013-02-08, 08:56
+
Nitin Pawar 2013-02-08, 09:45
+
syates@... 2013-02-08, 11:34
+
Mike Percy 2013-02-08, 22:09
+
Steven Yates 2013-02-10, 09:00
+
Steven Yates 2013-02-08, 10:45