Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume, mail # user - Analysis of Data


+
Surindhar 2013-02-07, 09:52
+
Nitin Pawar 2013-02-07, 10:15
+
Surindhar 2013-02-07, 10:24
+
Bertrand Dechoux 2013-02-07, 10:30
+
Inder Pall 2013-02-07, 10:39
+
Mike Percy 2013-02-07, 10:59
+
Nitin Pawar 2013-02-07, 11:22
+
Steven Yates 2013-02-07, 23:04
+
Mike Percy 2013-02-08, 03:00
+
Mike Percy 2013-02-08, 02:46
+
Steve Yates 2013-02-08, 03:22
+
Nitin Pawar 2013-02-08, 04:55
+
Inder Pall 2013-02-08, 08:48
+
Mike Percy 2013-02-08, 08:56
+
Nitin Pawar 2013-02-08, 09:45
+
syates@... 2013-02-08, 11:34
+
Mike Percy 2013-02-08, 22:09
+
Steven Yates 2013-02-10, 09:00
Copy link to this message
-
Re: Analysis of Data
Steven Yates 2013-02-08, 10:45
Nitin, +1 on the Storm sink. Worth discussing further IMO.

-Steve

From:  Nitin Pawar <[EMAIL PROTECTED]>
Date:  Fri, 8 Feb 2013 10:25:51 +0530
To:  <[EMAIL PROTECTED]>, Steven Yates <[EMAIL PROTECTED]>
Subject:  Re: Analysis of Data

Hi Steve,

I can understand  the idea of having data processed inside flume by
streaming it to another flume agent. But do we really need to re-engineer
something inside flume is what I am thinking? Core flume dev team may have
better ideas on this but currently for streaming data processing storm is a
huge candidate.
flume does have have an open jira on this integration FLUME-1286
<https://issues.apache.org/jira/browse/FLUME-1286>

It will be interesting to draw up the comparisons in performance if the data
processing logic is added to to flume. We do see currently people having a
little bit of pre-processing of their data (they have their own custom
channel types where they modify the data and sink it)
On Fri, Feb 8, 2013 at 8:52 AM, Steve Yates <[EMAIL PROTECTED]> wrote:
> Thanks for your feedback Mike, I have been thinking about this a little more
> and just using Mahout as an example I was considering the concept of somehow
> developing an enriched 'sink' so to speak where it would accept input streams
> / msgs from a flume channel and onforward specifically to a 'service' i.e
> Mahout service which would subsequently deliver the results to the configured
> sink. So yes it would behave as an intercept->filter->process->sink for
> applicable data items.
>
> I apologise if that is still vague. It would be great to receive further
> feedback from the user group.
>
> -Steve
>  
> Mike Percy <[EMAIL PROTECTED]> wrote:
> Hi Steven,
> Thanks for chiming in! Please see my responses inline:
>
> On Thu, Feb 7, 2013 at 3:04 PM, Steven Yates <[EMAIL PROTECTED]> wrote:
>> The only missing link within the Flume architecture I see in this
>> conversation is the actual channel's and brokers themselves which orchestrate
>> this lovely undertaking of data collection.
>
> Can you define what you mean by channels and brokers in this context? Since
> channel is a synonym for queueing event buffer in Flume parlance. Also, can
> you elaborate more on what you mean by orchestration? I think I know where
> you're going but I don't want to put words in your mouth.
>
>> One opportunity I do see (and I may be wrong) is for the data to offloaded
>> into a system such as Apache Mahout  before being sent to the sink. Perhaps
>> the concept of a ChannelAdapter of sorts? I.e Mahout Adapter ? Just thinking
>> out loud and it may be well out of the question.
>
> Why not a Mahout sink? Since Mahout often wants sequence files in a particular
> format to begin its MapReduce processing (e.g. its k-Means clustering
> implementation), Flume is already a good fit with its HDFS sink and
> EventSerializers allowing for writing a plugin to format your data however it
> needs to go in. In fact that works today if you have a batch (even 5-minute
> batch) use case. With today's functionality, you could use Oozie to coordinate
> kicking off the Mahout M/R job periodically, as new data becomes available and
> the files are rolled.
>
> Perhaps even more interestingly, I can see a use case where you might want to
> use Mahout to do streaming / realtime updates driven by Flume in the form of
> an interceptor or a Mahout sink. If online machine learning (e.g. stochastic
> gradient descent or something else online) was what you were thinking, I
> wonder if there are any folks on this list who might have an interest in
> helping to work on putting such a thing together.
>
> In any case, I'd like to hear more about specific use cases for streaming
> analytics. :)
>
> Regards,
> Mike
>

--
Nitin Pawar