|
Surindhar
2013-02-07, 09:52
Inder Pall
2013-02-07, 10:39
Mike Percy
2013-02-07, 10:59
Nitin Pawar
2013-02-07, 11:22
Steven Yates
2013-02-07, 23:04
Mike Percy
2013-02-08, 03:00
Mike Percy
2013-02-08, 02:46
Nitin Pawar
2013-02-07, 10:15
Surindhar
2013-02-07, 10:24
Bertrand Dechoux
2013-02-07, 10:30
Steve Yates
2013-02-08, 03:22
Nitin Pawar
2013-02-08, 04:55
Steven Yates
2013-02-08, 10:45
Mike Percy
2013-02-08, 08:56
Nitin Pawar
2013-02-08, 09:45
syates@...
2013-02-08, 11:34
Mike Percy
2013-02-08, 22:09
Steven Yates
2013-02-10, 09:00
Inder Pall
2013-02-08, 08:48
|
-
Analysis of DataSurindhar 2013-02-07, 09:52
Hi,
Does Flume supports Analysis of Data? Br, +
Surindhar 2013-02-07, 09:52
-
Re: Analysis of DataInder Pall 2013-02-07, 10:39
flume is a platform to get events to the right sink (HDFS, local-file,
....) analytics is not something which falls in it's territory - Inder On Thu, Feb 7, 2013 at 3:22 PM, Surindhar <[EMAIL PROTECTED]> wrote: > Hi, > > Does Flume supports Analysis of Data? > > Br, > > > -- - Inder "You are average of the 5 people you spend the most time with" +
Inder Pall 2013-02-07, 10:39
-
Re: Analysis of DataMike Percy 2013-02-07, 10:59
Let's take this conversation further. What is missing?
On Thu, Feb 7, 2013 at 2:39 AM, Inder Pall <[EMAIL PROTECTED]> wrote: > flume is a platform to get events to the right sink (HDFS, local-file, > ....) > analytics is not something which falls in it's territory > > - Inder > > > On Thu, Feb 7, 2013 at 3:22 PM, Surindhar <[EMAIL PROTECTED]> wrote: > >> Hi, >> >> Does Flume supports Analysis of Data? >> >> Br, >> >> >> > > > -- > - Inder > "You are average of the 5 people you spend the most time with" > +
Mike Percy 2013-02-07, 10:59
-
Re: Analysis of DataNitin Pawar 2013-02-07, 11:22
1) Flume is isolated distributed system in the sense one agent does not
idea about any other agent 2) Flume in the sense when needs to collect data from multiple references and work across different data sets, it may not have the entire data set needed 3) let us assume we have required data on agents for processing it in batches, do we really want to pressurize a live production server for data processing which can be done by systems like storm or hadoop or other system? these are my ideas .. i can be totally wrong but just from systems point of view it looks good option to keep data acquisition separate from data processing and then storing the processed data for further data serving On Thu, Feb 7, 2013 at 4:29 PM, Mike Percy <[EMAIL PROTECTED]> wrote: > Let's take this conversation further. What is missing? > > > On Thu, Feb 7, 2013 at 2:39 AM, Inder Pall <[EMAIL PROTECTED]> wrote: > >> flume is a platform to get events to the right sink (HDFS, local-file, >> ....) >> analytics is not something which falls in it's territory >> >> - Inder >> >> >> On Thu, Feb 7, 2013 at 3:22 PM, Surindhar <[EMAIL PROTECTED]> wrote: >> >>> Hi, >>> >>> Does Flume supports Analysis of Data? >>> >>> Br, >>> >>> >>> >> >> >> -- >> - Inder >> "You are average of the 5 people you spend the most time with" >> > > -- Nitin Pawar +
Nitin Pawar 2013-02-07, 11:22
-
Re: Analysis of DataSteven Yates 2013-02-07, 23:04
The only missing link within the Flume architecture I see in this
conversation is the actual channel's and brokers themselves which orchestrate this lovely undertaking of data collection. One opportunity I do see (and I may be wrong) is for the data to offloaded into a system such as Apache Mahout before being sent to the sink. Perhaps the concept of a ChannelAdapter of sorts? I.e Mahout Adapter ? Just thinking out loud and it may be well out of the question. Thanks Steve From: Nitin Pawar <[EMAIL PROTECTED]> Reply-To: <[EMAIL PROTECTED]> Date: Thu, 7 Feb 2013 16:52:12 +0530 To: <[EMAIL PROTECTED]> Subject: Re: Analysis of Data 1) Flume is isolated distributed system in the sense one agent does not idea about any other agent 2) Flume in the sense when needs to collect data from multiple references and work across different data sets, it may not have the entire data set needed 3) let us assume we have required data on agents for processing it in batches, do we really want to pressurize a live production server for data processing which can be done by systems like storm or hadoop or other system? these are my ideas .. i can be totally wrong but just from systems point of view it looks good option to keep data acquisition separate from data processing and then storing the processed data for further data serving On Thu, Feb 7, 2013 at 4:29 PM, Mike Percy <[EMAIL PROTECTED]> wrote: > Let's take this conversation further. What is missing? > > > On Thu, Feb 7, 2013 at 2:39 AM, Inder Pall <[EMAIL PROTECTED]> wrote: >> flume is a platform to get events to the right sink (HDFS, local-file, ....) >> analytics is not something which falls in it's territory >> >> - Inder >> >> >> On Thu, Feb 7, 2013 at 3:22 PM, Surindhar <[EMAIL PROTECTED]> wrote: >>> Hi, >>> >>> Does Flume supports Analysis of Data? >>> >>> Br, >>> >>> >> >> >> >> -- >> - Inder >> "You are average of the 5 people you spend the most time with" > -- Nitin Pawar +
Steven Yates 2013-02-07, 23:04
-
Re: Analysis of DataMike Percy 2013-02-08, 03:00
Hi Steven,
Thanks for chiming in! Please see my responses inline: On Thu, Feb 7, 2013 at 3:04 PM, Steven Yates <[EMAIL PROTECTED]>wrote: > The only missing link within the Flume architecture I see in this > conversation is the actual channel's and brokers themselves which > orchestrate this lovely undertaking of data collection. > Can you define what you mean by channels and brokers in this context? Since channel is a synonym for queueing event buffer in Flume parlance. Also, can you elaborate more on what you mean by orchestration? I think I know where you're going but I don't want to put words in your mouth. One opportunity I do see (and I may be wrong) is for the data to offloaded > into a system such as Apache Mahout before being sent to the sink. Perhaps > the concept of a ChannelAdapter of sorts? I.e Mahout Adapter ? Just > thinking out loud and it may be well out of the question. > Why not a Mahout sink? Since Mahout often wants sequence files in a particular format to begin its MapReduce processing (e.g. its k-Means clustering implementation), Flume is already a good fit with its HDFS sink and EventSerializers allowing for writing a plugin to format your data however it needs to go in. In fact that works today if you have a batch (even 5-minute batch) use case. With today's functionality, you could use Oozie to coordinate kicking off the Mahout M/R job periodically, as new data becomes available and the files are rolled. Perhaps even more interestingly, I can see a use case where you might want to use Mahout to do streaming / realtime updates driven by Flume in the form of an interceptor or a Mahout sink. If online machine learning (e.g. stochastic gradient descent or something else online) was what you were thinking, I wonder if there are any folks on this list who might have an interest in helping to work on putting such a thing together. In any case, I'd like to hear more about specific use cases for streaming analytics. :) Regards, Mike +
Mike Percy 2013-02-08, 03:00
-
Re: Analysis of DataMike Percy 2013-02-08, 02:46
Thanks for replying Nitin. My thoughts inline:
On Thu, Feb 7, 2013 at 3:22 AM, Nitin Pawar <[EMAIL PROTECTED]> wrote: > 1) Flume is isolated distributed system in the sense one agent does not > idea about any other agent > Avro sinks know about downstream Avro sources, so basically it's a digraph, right? 2) Flume in the sense when needs to collect data from multiple references > and work across different data sets, it may not have the entire data set > needed > I see what you are saying, however that is often the case with a streaming data processing system, right? 3) let us assume we have required data on agents for processing it in > batches, do we really want to pressurize a live production server for data > processing which can be done by systems like storm or hadoop or other > system? > The data can be sent to downstream hops so there is no need to do data processing on the application tier. these are my ideas .. i can be totally wrong but just from systems point of > view it looks good option to keep data acquisition separate from data > processing and then storing the processed data for further data serving > In theory I agree with you, but because Flume can pipe data to downstream agents who can do the heavy processing, it seems to me that this requirement is easily fulfilled by Flume. Regards, Mike On Thu, Feb 7, 2013 at 4:29 PM, Mike Percy <[EMAIL PROTECTED]> wrote: > >> Let's take this conversation further. What is missing? >> >> >> On Thu, Feb 7, 2013 at 2:39 AM, Inder Pall <[EMAIL PROTECTED]> wrote: >> >>> flume is a platform to get events to the right sink (HDFS, local-file, >>> ....) >>> analytics is not something which falls in it's territory >>> >>> - Inder >>> >>> >>> On Thu, Feb 7, 2013 at 3:22 PM, Surindhar <[EMAIL PROTECTED]> wrote: >>> >>>> Hi, >>>> >>>> Does Flume supports Analysis of Data? >>>> >>>> Br, >>>> >>>> >>>> >>> >>> >>> -- >>> - Inder >>> "You are average of the 5 people you spend the most time with" >>> >> >> > > > -- > Nitin Pawar > +
Mike Percy 2013-02-08, 02:46
-
Re: Analysis of DataNitin Pawar 2013-02-07, 10:15
it just supports collection of data
it does not understand anything about content of your data On Thu, Feb 7, 2013 at 3:22 PM, Surindhar <[EMAIL PROTECTED]> wrote: > Hi, > > Does Flume supports Analysis of Data? > > Br, > > > -- Nitin Pawar +
Nitin Pawar 2013-02-07, 10:15
-
Re: Analysis of DataSurindhar 2013-02-07, 10:24
Thanks Nitin,
On Thu, Feb 7, 2013 at 3:45 PM, Nitin Pawar <[EMAIL PROTECTED]> wrote: > it just supports collection of data > > it does not understand anything about content of your data > > > On Thu, Feb 7, 2013 at 3:22 PM, Surindhar <[EMAIL PROTECTED]> wrote: > >> Hi, >> >> Does Flume supports Analysis of Data? >> >> Br, >> >> >> > > > -- > Nitin Pawar > +
Surindhar 2013-02-07, 10:24
-
Re: Analysis of DataBertrand Dechoux 2013-02-07, 10:30
>From by point of view, while you could do it, flume is not really meant to
do it. It's not a CEP engine. But it can be extended. So it really depends on the complexity of your analysis. Bertrand On Thu, Feb 7, 2013 at 11:24 AM, Surindhar <[EMAIL PROTECTED]> wrote: > Thanks Nitin, > > On Thu, Feb 7, 2013 at 3:45 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote: > >> it just supports collection of data >> >> it does not understand anything about content of your data >> >> >> On Thu, Feb 7, 2013 at 3:22 PM, Surindhar <[EMAIL PROTECTED]> wrote: >> >>> Hi, >>> >>> Does Flume supports Analysis of Data? >>> >>> Br, >>> >>> >>> >> >> >> -- >> Nitin Pawar >> > > +
Bertrand Dechoux 2013-02-07, 10:30
-
Re: Analysis of DataSteve Yates 2013-02-08, 03:22
Thanks for your feedback Mike, I have been thinking about this a little more and just using Mahout as an example I was considering the concept of somehow developing an enriched 'sink' so to speak where it would accept input streams / msgs from a flume channel and onforward specifically to a 'service' i.e Mahout service which would subsequently deliver the results to the configured sink. So yes it would behave as an intercept->filter->process->sink for applicable data items.
I apologise if that is still vague. It would be great to receive further feedback from the user group. -SteveMike Percy <[EMAIL PROTECTED]> wrote:Hi Steven, Thanks for chiming in! Please see my responses inline: On Thu, Feb 7, 2013 at 3:04 PM, Steven Yates <[EMAIL PROTECTED]> wrote: The only missing link within the Flume architecture I see in this conversation is the actual channel's and brokers themselves which orchestrate this lovely undertaking of data collection. Can you define what you mean by channels and brokers in this context? Since channel is a synonym for queueing event buffer in Flume parlance. Also, can you elaborate more on what you mean by orchestration? I think I know where you're going but I don't want to put words in your mouth. One opportunity I do see (and I may be wrong) is for the data to offloaded into a system such as Apache Mahout before being sent to the sink. Perhaps the concept of a ChannelAdapter of sorts? I.e Mahout Adapter ? Just thinking out loud and it may be well out of the question. Why not a Mahout sink? Since Mahout often wants sequence files in a particular format to begin its MapReduce processing (e.g. its k-Means clustering implementation), Flume is already a good fit with its HDFS sink and EventSerializers allowing for writing a plugin to format your data however it needs to go in. In fact that works today if you have a batch (even 5-minute batch) use case. With today's functionality, you could use Oozie to coordinate kicking off the Mahout M/R job periodically, as new data becomes available and the files are rolled. Perhaps even more interestingly, I can see a use case where you might want to use Mahout to do streaming / realtime updates driven by Flume in the form of an interceptor or a Mahout sink. If online machine learning (e.g. stochastic gradient descent or something else online) was what you were thinking, I wonder if there are any folks on this list who might have an interest in helping to work on putting such a thing together. In any case, I'd like to hear more about specific use cases for streaming analytics. :) Regards, Mike +
Steve Yates 2013-02-08, 03:22
-
Re: Analysis of DataNitin Pawar 2013-02-08, 04:55
Hi Steve,
I can understand the idea of having data processed inside flume by streaming it to another flume agent. But do we really need to re-engineer something inside flume is what I am thinking? Core flume dev team may have better ideas on this but currently for streaming data processing storm is a huge candidate. flume does have have an open jira on this integration FLUME-1286<https://issues.apache.org/jira/browse/FLUME-1286> It will be interesting to draw up the comparisons in performance if the data processing logic is added to to flume. We do see currently people having a little bit of pre-processing of their data (they have their own custom channel types where they modify the data and sink it) On Fri, Feb 8, 2013 at 8:52 AM, Steve Yates <[EMAIL PROTECTED]> wrote: > Thanks for your feedback Mike, I have been thinking about this a little > more and just using Mahout as an example I was considering the concept of > somehow developing an enriched 'sink' so to speak where it would accept > input streams / msgs from a flume channel and onforward specifically to a > 'service' i.e Mahout service which would subsequently deliver the results > to the configured sink. So yes it would behave as an > intercept->filter->process->sink for applicable data items. > > I apologise if that is still vague. It would be great to receive further > feedback from the user group. > > -Steve > > Mike Percy <[EMAIL PROTECTED]> wrote: > Hi Steven, > Thanks for chiming in! Please see my responses inline: > > On Thu, Feb 7, 2013 at 3:04 PM, Steven Yates <[EMAIL PROTECTED]>wrote: > >> The only missing link within the Flume architecture I see in this >> conversation is the actual channel's and brokers themselves which >> orchestrate this lovely undertaking of data collection. >> > > Can you define what you mean by channels and brokers in this context? > Since channel is a synonym for queueing event buffer in Flume parlance. > Also, can you elaborate more on what you mean by orchestration? I think I > know where you're going but I don't want to put words in your mouth. > > One opportunity I do see (and I may be wrong) is for the data to offloaded >> into a system such as Apache Mahout before being sent to the sink. Perhaps >> the concept of a ChannelAdapter of sorts? I.e Mahout Adapter ? Just >> thinking out loud and it may be well out of the question. >> > > Why not a Mahout sink? Since Mahout often wants sequence files in a > particular format to begin its MapReduce processing (e.g. its k-Means > clustering implementation), Flume is already a good fit with its HDFS sink > and EventSerializers allowing for writing a plugin to format your data > however it needs to go in. In fact that works today if you have a batch > (even 5-minute batch) use case. With today's functionality, you could use > Oozie to coordinate kicking off the Mahout M/R job periodically, as new > data becomes available and the files are rolled. > > Perhaps even more interestingly, I can see a use case where you might want > to use Mahout to do streaming / realtime updates driven by Flume in the > form of an interceptor or a Mahout sink. If online machine learning (e.g. > stochastic gradient descent or something else online) was what you were > thinking, I wonder if there are any folks on this list who might have an > interest in helping to work on putting such a thing together. > > In any case, I'd like to hear more about specific use cases for streaming > analytics. :) > > Regards, > Mike > > -- Nitin Pawar +
Nitin Pawar 2013-02-08, 04:55
-
Re: Analysis of DataSteven Yates 2013-02-08, 10:45
Nitin, +1 on the Storm sink. Worth discussing further IMO.
-Steve From: Nitin Pawar <[EMAIL PROTECTED]> Date: Fri, 8 Feb 2013 10:25:51 +0530 To: <[EMAIL PROTECTED]>, Steven Yates <[EMAIL PROTECTED]> Subject: Re: Analysis of Data Hi Steve, I can understand the idea of having data processed inside flume by streaming it to another flume agent. But do we really need to re-engineer something inside flume is what I am thinking? Core flume dev team may have better ideas on this but currently for streaming data processing storm is a huge candidate. flume does have have an open jira on this integration FLUME-1286 <https://issues.apache.org/jira/browse/FLUME-1286> It will be interesting to draw up the comparisons in performance if the data processing logic is added to to flume. We do see currently people having a little bit of pre-processing of their data (they have their own custom channel types where they modify the data and sink it) On Fri, Feb 8, 2013 at 8:52 AM, Steve Yates <[EMAIL PROTECTED]> wrote: > Thanks for your feedback Mike, I have been thinking about this a little more > and just using Mahout as an example I was considering the concept of somehow > developing an enriched 'sink' so to speak where it would accept input streams > / msgs from a flume channel and onforward specifically to a 'service' i.e > Mahout service which would subsequently deliver the results to the configured > sink. So yes it would behave as an intercept->filter->process->sink for > applicable data items. > > I apologise if that is still vague. It would be great to receive further > feedback from the user group. > > -Steve > > Mike Percy <[EMAIL PROTECTED]> wrote: > Hi Steven, > Thanks for chiming in! Please see my responses inline: > > On Thu, Feb 7, 2013 at 3:04 PM, Steven Yates <[EMAIL PROTECTED]> wrote: >> The only missing link within the Flume architecture I see in this >> conversation is the actual channel's and brokers themselves which orchestrate >> this lovely undertaking of data collection. > > Can you define what you mean by channels and brokers in this context? Since > channel is a synonym for queueing event buffer in Flume parlance. Also, can > you elaborate more on what you mean by orchestration? I think I know where > you're going but I don't want to put words in your mouth. > >> One opportunity I do see (and I may be wrong) is for the data to offloaded >> into a system such as Apache Mahout before being sent to the sink. Perhaps >> the concept of a ChannelAdapter of sorts? I.e Mahout Adapter ? Just thinking >> out loud and it may be well out of the question. > > Why not a Mahout sink? Since Mahout often wants sequence files in a particular > format to begin its MapReduce processing (e.g. its k-Means clustering > implementation), Flume is already a good fit with its HDFS sink and > EventSerializers allowing for writing a plugin to format your data however it > needs to go in. In fact that works today if you have a batch (even 5-minute > batch) use case. With today's functionality, you could use Oozie to coordinate > kicking off the Mahout M/R job periodically, as new data becomes available and > the files are rolled. > > Perhaps even more interestingly, I can see a use case where you might want to > use Mahout to do streaming / realtime updates driven by Flume in the form of > an interceptor or a Mahout sink. If online machine learning (e.g. stochastic > gradient descent or something else online) was what you were thinking, I > wonder if there are any folks on this list who might have an interest in > helping to work on putting such a thing together. > > In any case, I'd like to hear more about specific use cases for streaming > analytics. :) > > Regards, > Mike > -- Nitin Pawar +
Steven Yates 2013-02-08, 10:45
-
Re: Analysis of DataMike Percy 2013-02-08, 08:56
Nitin,
Good to hear more of your thoughts. Please see inline. On Thu, Feb 7, 2013 at 8:55 PM, Nitin Pawar <[EMAIL PROTECTED]> wrote: I can understand the idea of having data processed inside flume by > streaming it to another flume agent. But do we really need to re-engineer > something inside flume is what I am thinking? Core flume dev team may have > better ideas on this but currently for streaming data processing storm is a > huge candidate. > flume does have have an open jira on this integration FLUME-1286<https://issues.apache.org/jira/browse/FLUME-1286> > Yes, a Storm sink could be useful. But that wouldn't preclude us from taking a hard look at what may be missing in Flume itself, right? It will be interesting to draw up the comparisons in performance if the > data processing logic is added to to flume. We do see currently people > having a little bit of pre-processing of their data (they have their own > custom channel types where they modify the data and sink it) > It sounds like you have some experience with Flume. Are you guys using it at Rightster? I work with a lot of folks to set up and deploy Flume, many of which do lookups / joins with other systems, transformations, etc. in real time along their data ingest pipeline before writing the data to HDFS or HBase for further processing and archival. I wouldn't say these are really heavy number crunching implementations in Flume, but certainly i see a lot of inline parsing, inspection, enrichment, routing, and the like going on. I think Flume could do a lot more, given the right abstractions. Regards, Mike +
Mike Percy 2013-02-08, 08:56
-
Re: Analysis of DataNitin Pawar 2013-02-08, 09:45
Mike, Yes
I am not against the approach flume doing it. I would love to see it part of flume (it ofcourse helps to remove overload of one processing engine). As flume already supports the grouping of agents to the normal route of acquisition and sink can continue. In another route, we can have it to sink to a processor source of flume which then converts the data and runs quick analysis on data in memory and update the global counters kind of things which then can be sink to live reporting systems. Thanks, Nitin On Fri, Feb 8, 2013 at 2:26 PM, Mike Percy <[EMAIL PROTECTED]> wrote: > Nitin, > Good to hear more of your thoughts. Please see inline. > > On Thu, Feb 7, 2013 at 8:55 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote: > > I can understand the idea of having data processed inside flume by >> streaming it to another flume agent. But do we really need to re-engineer >> something inside flume is what I am thinking? Core flume dev team may have >> better ideas on this but currently for streaming data processing storm is a >> huge candidate. >> flume does have have an open jira on this integration FLUME-1286<https://issues.apache.org/jira/browse/FLUME-1286> >> > > Yes, a Storm sink could be useful. But that wouldn't preclude us from > taking a hard look at what may be missing in Flume itself, right? > > It will be interesting to draw up the comparisons in performance if the >> data processing logic is added to to flume. We do see currently people >> having a little bit of pre-processing of their data (they have their own >> custom channel types where they modify the data and sink it) >> > > It sounds like you have some experience with Flume. Are you guys using it > at Rightster? > > I work with a lot of folks to set up and deploy Flume, many of which do > lookups / joins with other systems, transformations, etc. in real time > along their data ingest pipeline before writing the data to HDFS or HBase > for further processing and archival. I wouldn't say these are really heavy > number crunching implementations in Flume, but certainly i see a lot of > inline parsing, inspection, enrichment, routing, and the like going on. I > think Flume could do a lot more, given the right abstractions. > > Regards, > Mike > > -- Nitin Pawar +
Nitin Pawar 2013-02-08, 09:45
-
Re: Analysis of Datasyates@... 2013-02-08, 11:34
Hi Nitin,
Would it be feasible to consider the addition of another extension point with Flume for the purposes of custom filtering, enrichment, routing etc. Without trying to envision Flume away into something it was never designed for (i.e without going overboard) The concept of some sort of intermediate processing unit is quite attractive to me personally as I have my dedicated AvroSources purely for aggregating data however in the interest of modularisation I may want to perform some enrichment/filtering exercise before I dump the events on my durable channel. I guess the conversation of flow and some sort of declarative way of configuring the ordering of the processing units etc. Just thinking out loud. @Nitin/Mike , your experience in the field will assist in validating this further -Steve Quoting Nitin Pawar <[EMAIL PROTECTED]>: > Mike, Yes > > I am not against the approach flume doing it. I would love to see it part > of flume (it ofcourse helps to remove overload of one processing engine). > As flume already supports the grouping of agents to the normal route of > acquisition and sink can continue. > > In another route, we can have it to sink to a processor source of flume > which then converts the data and runs quick analysis on data in memory and > update the global counters kind of things which then can be sink to live > reporting systems. > > Thanks, > Nitin > > > On Fri, Feb 8, 2013 at 2:26 PM, Mike Percy <[EMAIL PROTECTED]> wrote: > >> Nitin, >> Good to hear more of your thoughts. Please see inline. >> >> On Thu, Feb 7, 2013 at 8:55 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote: >> >> I can understand the idea of having data processed inside flume by >>> streaming it to another flume agent. But do we really need to re-engineer >>> something inside flume is what I am thinking? Core flume dev team may have >>> better ideas on this but currently for streaming data processing storm is a >>> huge candidate. >>> flume does have have an open jira on this integration >>> FLUME-1286<https://issues.apache.org/jira/browse/FLUME-1286> >>> >> >> Yes, a Storm sink could be useful. But that wouldn't preclude us from >> taking a hard look at what may be missing in Flume itself, right? >> >> It will be interesting to draw up the comparisons in performance if the >>> data processing logic is added to to flume. We do see currently people >>> having a little bit of pre-processing of their data (they have their own >>> custom channel types where they modify the data and sink it) >>> >> >> It sounds like you have some experience with Flume. Are you guys using it >> at Rightster? >> >> I work with a lot of folks to set up and deploy Flume, many of which do >> lookups / joins with other systems, transformations, etc. in real time >> along their data ingest pipeline before writing the data to HDFS or HBase >> for further processing and archival. I wouldn't say these are really heavy >> number crunching implementations in Flume, but certainly i see a lot of >> inline parsing, inspection, enrichment, routing, and the like going on. I >> think Flume could do a lot more, given the right abstractions. >> >> Regards, >> Mike >> >> > > > -- > Nitin Pawar > +
syates@... 2013-02-08, 11:34
-
Re: Analysis of DataMike Percy 2013-02-08, 22:09
Steven,
Any reason you are not using interceptors for that? Can you provide more detail on what you are doing? See more about Interceptors here: http://flume.apache.org/FlumeUserGuide.html#flume-interceptors Regards Mike On Fri, Feb 8, 2013 at 3:34 AM, <[EMAIL PROTECTED]> wrote: > Hi Nitin, > > Would it be feasible to consider the addition of another extension point > with Flume for the purposes of custom filtering, enrichment, routing etc. > Without trying to envision Flume away into something it was never designed > for (i.e without going overboard) The concept of some sort of intermediate > processing unit is quite attractive to me personally as I have my dedicated > AvroSources purely for aggregating data however in the interest of > modularisation I may want to perform some enrichment/filtering exercise > before I dump the events on my durable channel. I guess the conversation of > flow and some sort of declarative way of configuring the ordering of the > processing units etc. Just thinking out loud. > > > @Nitin/Mike , your experience in the field will assist in validating this > further > > -Steve > > Quoting Nitin Pawar <[EMAIL PROTECTED]>: > > Mike, Yes >> >> I am not against the approach flume doing it. I would love to see it part >> of flume (it ofcourse helps to remove overload of one processing engine). >> As flume already supports the grouping of agents to the normal route of >> acquisition and sink can continue. >> >> In another route, we can have it to sink to a processor source of flume >> which then converts the data and runs quick analysis on data in memory and >> update the global counters kind of things which then can be sink to live >> reporting systems. >> >> Thanks, >> Nitin >> >> >> On Fri, Feb 8, 2013 at 2:26 PM, Mike Percy <[EMAIL PROTECTED]> wrote: >> >> Nitin, >>> Good to hear more of your thoughts. Please see inline. >>> >>> On Thu, Feb 7, 2013 at 8:55 PM, Nitin Pawar <[EMAIL PROTECTED]>** >>> wrote: >>> >>> I can understand the idea of having data processed inside flume by >>> >>>> streaming it to another flume agent. But do we really need to >>>> re-engineer >>>> something inside flume is what I am thinking? Core flume dev team may >>>> have >>>> better ideas on this but currently for streaming data processing storm >>>> is a >>>> huge candidate. >>>> flume does have have an open jira on this integration FLUME-1286< >>>> https://issues.**apache.org/jira/browse/FLUME-**1286<https://issues.apache.org/jira/browse/FLUME-1286> >>>> > >>>> >>>> >>> Yes, a Storm sink could be useful. But that wouldn't preclude us from >>> taking a hard look at what may be missing in Flume itself, right? >>> >>> It will be interesting to draw up the comparisons in performance if the >>> >>>> data processing logic is added to to flume. We do see currently people >>>> having a little bit of pre-processing of their data (they have their own >>>> custom channel types where they modify the data and sink it) >>>> >>>> >>> It sounds like you have some experience with Flume. Are you guys using it >>> at Rightster? >>> >>> I work with a lot of folks to set up and deploy Flume, many of which do >>> lookups / joins with other systems, transformations, etc. in real time >>> along their data ingest pipeline before writing the data to HDFS or HBase >>> for further processing and archival. I wouldn't say these are really >>> heavy >>> number crunching implementations in Flume, but certainly i see a lot of >>> inline parsing, inspection, enrichment, routing, and the like going on. I >>> think Flume could do a lot more, given the right abstractions. >>> >>> Regards, >>> Mike >>> >>> >>> >> >> -- >> Nitin Pawar >> >> > > > +
Mike Percy 2013-02-08, 22:09
-
Re: Analysis of DataSteven Yates 2013-02-10, 09:00
Absolutely Mike thank you.
Specifically though it would be nice to be able to feedback the results from an external process (such as Mahout or Storm) into a Flume channel/sink? -Steve From: Mike Percy <[EMAIL PROTECTED]> Reply-To: <[EMAIL PROTECTED]> Date: Fri, 8 Feb 2013 14:09:04 -0800 To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Cc: Nitin Pawar <[EMAIL PROTECTED]> Subject: Re: Analysis of Data Steven, Any reason you are not using interceptors for that? Can you provide more detail on what you are doing? See more about Interceptors here: http://flume.apache.org/FlumeUserGuide.html#flume-interceptors Regards Mike On Fri, Feb 8, 2013 at 3:34 AM, <[EMAIL PROTECTED]> wrote: > Hi Nitin, > > Would it be feasible to consider the addition of another extension point with > Flume for the purposes of custom filtering, enrichment, routing etc. Without > trying to envision Flume away into something it was never designed for (i.e > without going overboard) The concept of some sort of intermediate processing > unit is quite attractive to me personally as I have my dedicated AvroSources > purely for aggregating data however in the interest of modularisation I may > want to perform some enrichment/filtering exercise before I dump the events on > my durable channel. I guess the conversation of flow and some sort of > declarative way of configuring the ordering of the processing units etc. Just > thinking out loud. > > > @Nitin/Mike , your experience in the field will assist in validating this > further > > -Steve > > Quoting Nitin Pawar <[EMAIL PROTECTED]>: > >> Mike, Yes >> >> I am not against the approach flume doing it. I would love to see it part >> of flume (it ofcourse helps to remove overload of one processing engine). >> As flume already supports the grouping of agents to the normal route of >> acquisition and sink can continue. >> >> In another route, we can have it to sink to a processor source of flume >> which then converts the data and runs quick analysis on data in memory and >> update the global counters kind of things which then can be sink to live >> reporting systems. >> >> Thanks, >> Nitin >> >> >> On Fri, Feb 8, 2013 at 2:26 PM, Mike Percy <[EMAIL PROTECTED]> wrote: >> >>> Nitin, >>> Good to hear more of your thoughts. Please see inline. >>> >>> On Thu, Feb 7, 2013 at 8:55 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote: >>> >>> I can understand the idea of having data processed inside flume by >>>> streaming it to another flume agent. But do we really need to re-engineer >>>> something inside flume is what I am thinking? Core flume dev team may have >>>> better ideas on this but currently for streaming data processing storm is a >>>> huge candidate. >>>> flume does have have an open jira on this integration >>>> FLUME-1286<https://issues.apache.org/jira/browse/FLUME-1286 >>>> <https://issues.apache.org/jira/browse/FLUME-1286> > >>>> >>> >>> Yes, a Storm sink could be useful. But that wouldn't preclude us from >>> taking a hard look at what may be missing in Flume itself, right? >>> >>> It will be interesting to draw up the comparisons in performance if the >>>> data processing logic is added to to flume. We do see currently people >>>> having a little bit of pre-processing of their data (they have their own >>>> custom channel types where they modify the data and sink it) >>>> >>> >>> It sounds like you have some experience with Flume. Are you guys using it >>> at Rightster? >>> >>> I work with a lot of folks to set up and deploy Flume, many of which do >>> lookups / joins with other systems, transformations, etc. in real time >>> along their data ingest pipeline before writing the data to HDFS or HBase >>> for further processing and archival. I wouldn't say these are really heavy >>> number crunching implementations in Flume, but certainly i see a lot of >>> inline parsing, inspection, enrichment, routing, and the like going on. I >>> think Flume could do a lot more, given the right abstractions. +
Steven Yates 2013-02-10, 09:00
-
Re: Analysis of DataInder Pall 2013-02-08, 08:48
Another thought - for streaming analytics you'd need a system which scales
so in retrospective how about something like a STORM SINK which internally can use FLUME again to write the processed event to a persistent SINK. - Inder On Fri, Feb 8, 2013 at 10:25 AM, Nitin Pawar <[EMAIL PROTECTED]>wrote: > Hi Steve, > > I can understand the idea of having data processed inside flume by > streaming it to another flume agent. But do we really need to re-engineer > something inside flume is what I am thinking? Core flume dev team may have > better ideas on this but currently for streaming data processing storm is a > huge candidate. > flume does have have an open jira on this integration FLUME-1286<https://issues.apache.org/jira/browse/FLUME-1286> > > It will be interesting to draw up the comparisons in performance if the > data processing logic is added to to flume. We do see currently people > having a little bit of pre-processing of their data (they have their own > custom channel types where they modify the data and sink it) > > > On Fri, Feb 8, 2013 at 8:52 AM, Steve Yates <[EMAIL PROTECTED]>wrote: > >> Thanks for your feedback Mike, I have been thinking about this a little >> more and just using Mahout as an example I was considering the concept of >> somehow developing an enriched 'sink' so to speak where it would accept >> input streams / msgs from a flume channel and onforward specifically to a >> 'service' i.e Mahout service which would subsequently deliver the results >> to the configured sink. So yes it would behave as an >> intercept->filter->process->sink for applicable data items. >> >> I apologise if that is still vague. It would be great to receive further >> feedback from the user group. >> >> -Steve >> >> Mike Percy <[EMAIL PROTECTED]> wrote: >> Hi Steven, >> Thanks for chiming in! Please see my responses inline: >> >> On Thu, Feb 7, 2013 at 3:04 PM, Steven Yates <[EMAIL PROTECTED]>wrote: >> >>> The only missing link within the Flume architecture I see in this >>> conversation is the actual channel's and brokers themselves which >>> orchestrate this lovely undertaking of data collection. >>> >> >> Can you define what you mean by channels and brokers in this context? >> Since channel is a synonym for queueing event buffer in Flume parlance. >> Also, can you elaborate more on what you mean by orchestration? I think I >> know where you're going but I don't want to put words in your mouth. >> >> One opportunity I do see (and I may be wrong) is for the data to >>> offloaded into a system such as Apache Mahout before being sent to the >>> sink. Perhaps the concept of a ChannelAdapter of sorts? I.e Mahout Adapter >>> ? Just thinking out loud and it may be well out of the question. >>> >> >> Why not a Mahout sink? Since Mahout often wants sequence files in a >> particular format to begin its MapReduce processing (e.g. its k-Means >> clustering implementation), Flume is already a good fit with its HDFS sink >> and EventSerializers allowing for writing a plugin to format your data >> however it needs to go in. In fact that works today if you have a batch >> (even 5-minute batch) use case. With today's functionality, you could use >> Oozie to coordinate kicking off the Mahout M/R job periodically, as new >> data becomes available and the files are rolled. >> >> Perhaps even more interestingly, I can see a use case where you might >> want to use Mahout to do streaming / realtime updates driven by Flume in >> the form of an interceptor or a Mahout sink. If online machine learning >> (e.g. stochastic gradient descent or something else online) was what you >> were thinking, I wonder if there are any folks on this list who might have >> an interest in helping to work on putting such a thing together. >> >> In any case, I'd like to hear more about specific use cases for streaming >> analytics. :) >> >> Regards, >> Mike >> >> > > > -- > Nitin Pawar > -- - Inder "You are average of the 5 people you spend the most time with" +
Inder Pall 2013-02-08, 08:48
|