Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume, mail # dev - JIRA Storm/Mahout Sink


Copy link to this message
-
JIRA Storm/Mahout Sink
syates@... 2013-07-16, 07:09
Hi Dev, if there is suitable interest I would like to discuss the  
below thread further and open up any further opportunities for  
streaming analytics within Flume.

One particular use case I am considering right now is the ability to  
capture a continuous (varying velocities) stream of user generated  
events from different sources and overlay this stream with a  
plugin-style dashboard. This dashbboard will allow a user to  
create/manage an underlying stream and apply various "probes" (so to  
speak) .

I understand the Storm.project is a great contender at this point in  
time for offloading computational tasks for real-time "like" feedback.  
However I do not believe at this stage the Storm project has the  
intuitive dashboard-like pluggable probe system (call it what you  
like) that I am interested in.

I would be interested in working with the Flume contribs to develop  
such an addon proj.

All thoughts are welcome

Yours in flume;
-Steve

Hi Steve,

I can understand  the idea of having data processed inside flume by
streaming it to another flume agent. But do we really need to re-engineer
something inside flume is what I am thinking? Core flume dev team may have
better ideas on this but currently for streaming data processing storm is a
huge candidate.
flume does have have an open jira on this integration
FLUME-1286<https://issues.apache.org/jira/browse/FLUME-1286>

It will be interesting to draw up the comparisons in performance if the
data processing logic is added to to flume. We do see currently people
having a little bit of pre-processing of their data (they have their own
custom channel types where they modify the data and sink it)
On Fri, Feb 8, 2013 at 8:52 AM, Steve Yates <[EMAIL PROTECTED]> wrote:

> Thanks for your feedback Mike, I have been thinking about this a little
> more and just using Mahout as an example I was considering the concept of
> somehow developing an enriched 'sink' so to speak where it would accept
> input streams / msgs from a flume channel and onforward specifically to a
> 'service' i.e Mahout service which would subsequently deliver the results
> to the configured sink. So yes it would behave as an
> intercept->filter->process->sink for applicable data items.
>
> I apologise if that is still vague. It would be great to receive further
> feedback from the user group.
>
> -Steve
>
> Mike Percy <[EMAIL PROTECTED]> wrote:
> Hi Steven,
> Thanks for chiming in! Please see my responses inline:
>
> On Thu, Feb 7, 2013 at 3:04 PM, Steven Yates <[EMAIL PROTECTED]>wrote:
>
>> The only missing link within the Flume architecture I see in this
>> conversation is the actual channel's and brokers themselves which
>> orchestrate this lovely undertaking of data collection.
>>
>
> Can you define what you mean by channels and brokers in this context?
> Since channel is a synonym for queueing event buffer in Flume parlance.
> Also, can you elaborate more on what you mean by orchestration? I think I
> know where you're going but I don't want to put words in your mouth.
>
> One opportunity I do see (and I may be wrong) is for the data to offloaded
>> into a system such as Apache Mahout  before being sent to the sink. Perhaps
>> the concept of a ChannelAdapter of sorts? I.e Mahout Adapter ? Just
>> thinking out loud and it may be well out of the question.
>>
>
> Why not a Mahout sink? Since Mahout often wants sequence files in a
> particular format to begin its MapReduce processing (e.g. its k-Means
> clustering implementation), Flume is already a good fit with its HDFS sink
> and EventSerializers allowing for writing a plugin to format your data
> however it needs to go in. In fact that works today if you have a batch
> (even 5-minute batch) use case. With today's functionality, you could use
> Oozie to coordinate kicking off the Mahout M/R job periodically, as new
> data becomes available and the files are rolled.
>
> Perhaps even more interestingly, I can see a use case where you might want
Nitin Pawar