Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # dev >> JIRA Storm/Mahout Sink


Copy link to this message
-
JIRA Storm/Mahout Sink
Hi Dev, if there is suitable interest I would like to discuss the  
below thread further and open up any further opportunities for  
streaming analytics within Flume.

One particular use case I am considering right now is the ability to  
capture a continuous (varying velocities) stream of user generated  
events from different sources and overlay this stream with a  
plugin-style dashboard. This dashbboard will allow a user to  
create/manage an underlying stream and apply various "probes" (so to  
speak) .

I understand the Storm.project is a great contender at this point in  
time for offloading computational tasks for real-time "like" feedback.  
However I do not believe at this stage the Storm project has the  
intuitive dashboard-like pluggable probe system (call it what you  
like) that I am interested in.

I would be interested in working with the Flume contribs to develop  
such an addon proj.

All thoughts are welcome

Yours in flume;
-Steve

Hi Steve,

I can understand  the idea of having data processed inside flume by
streaming it to another flume agent. But do we really need to re-engineer
something inside flume is what I am thinking? Core flume dev team may have
better ideas on this but currently for streaming data processing storm is a
huge candidate.
flume does have have an open jira on this integration
FLUME-1286<https://issues.apache.org/jira/browse/FLUME-1286>

It will be interesting to draw up the comparisons in performance if the
data processing logic is added to to flume. We do see currently people
having a little bit of pre-processing of their data (they have their own
custom channel types where they modify the data and sink it)
On Fri, Feb 8, 2013 at 8:52 AM, Steve Yates <[EMAIL PROTECTED]> wrote:

> Thanks for your feedback Mike, I have been thinking about this a little
> more and just using Mahout as an example I was considering the concept of
> somehow developing an enriched 'sink' so to speak where it would accept
> input streams / msgs from a flume channel and onforward specifically to a
> 'service' i.e Mahout service which would subsequently deliver the results
> to the configured sink. So yes it would behave as an
> intercept->filter->process->sink for applicable data items.
>
> I apologise if that is still vague. It would be great to receive further
> feedback from the user group.
>
> -Steve
>
> Mike Percy <[EMAIL PROTECTED]> wrote:
> Hi Steven,
> Thanks for chiming in! Please see my responses inline:
>
> On Thu, Feb 7, 2013 at 3:04 PM, Steven Yates <[EMAIL PROTECTED]>wrote:
>
>> The only missing link within the Flume architecture I see in this
>> conversation is the actual channel's and brokers themselves which
>> orchestrate this lovely undertaking of data collection.
>>
>
> Can you define what you mean by channels and brokers in this context?
> Since channel is a synonym for queueing event buffer in Flume parlance.
> Also, can you elaborate more on what you mean by orchestration? I think I
> know where you're going but I don't want to put words in your mouth.
>
> One opportunity I do see (and I may be wrong) is for the data to offloaded
>> into a system such as Apache Mahout  before being sent to the sink. Perhaps
>> the concept of a ChannelAdapter of sorts? I.e Mahout Adapter ? Just
>> thinking out loud and it may be well out of the question.
>>
>
> Why not a Mahout sink? Since Mahout often wants sequence files in a
> particular format to begin its MapReduce processing (e.g. its k-Means
> clustering implementation), Flume is already a good fit with its HDFS sink
> and EventSerializers allowing for writing a plugin to format your data
> however it needs to go in. In fact that works today if you have a batch
> (even 5-minute batch) use case. With today's functionality, you could use
> Oozie to coordinate kicking off the Mahout M/R job periodically, as new
> data becomes available and the files are rolled.
>
> Perhaps even more interestingly, I can see a use case where you might want
Nitin Pawar
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB