|
|
-
filter before flush to disk
S Ahmed 2012-05-15, 13:38
Would it be possible to filter the collection before it gets flush to disk?
Say I am tracking page views per user, and I could perform a rollup before it gets flushed to disk (using a hashmap with the key being the sessionId, and increment a counter for the duplicate entries).
And could this be done w/o modifying the original source, maybe through some sort of event/listener?
+
S Ahmed 2012-05-15, 13:38
-
Re: filter before flush to disk
Jay Kreps 2012-05-15, 15:24
Yeah I see where you are going with that. We toyed with this idea, but the idea of coupling processing to the log storage raises a lot of problems for general purpose usage. I think the direction we are going is instead to just let you co-locate this processing on the same box. This gives the isolation of separate processes and the overhead of the transfer over localhost is pretty minor.
-Jay
On Tue, May 15, 2012 at 6:38 AM, S Ahmed <[EMAIL PROTECTED]> wrote: > Would it be possible to filter the collection before it gets flush to disk? > > Say I am tracking page views per user, and I could perform a rollup before > it gets flushed to disk (using a hashmap with the key being the sessionId, > and increment a counter for the duplicate entries). > > And could this be done w/o modifying the original source, maybe through > some sort of event/listener?
+
Jay Kreps 2012-05-15, 15:24
-
Re: filter before flush to disk
S Ahmed 2012-05-15, 15:42
What do you mean?
" I think the direction we are going is instead to just let you co-locate this processing on the same box. This gives the isolation of separate processes and the overhead of the transfer over localhost is pretty minor. " I see what your saying as it is a specific implemention/use case that diverts from a general purpose mechanism, that's why I was suggesting maybe a hook/event based system.
On Tue, May 15, 2012 at 11:24 AM, Jay Kreps <[EMAIL PROTECTED]> wrote:
> Yeah I see where you are going with that. We toyed with this idea, but > the idea of coupling processing to the log storage raises a lot of > problems for general purpose usage. I think the direction we are going > is instead to just let you co-locate this processing on the same box. > This gives the isolation of separate processes and the overhead of the > transfer over localhost is pretty minor. > > -Jay > > On Tue, May 15, 2012 at 6:38 AM, S Ahmed <[EMAIL PROTECTED]> wrote: > > Would it be possible to filter the collection before it gets flush to > disk? > > > > Say I am tracking page views per user, and I could perform a rollup > before > > it gets flushed to disk (using a hashmap with the key being the > sessionId, > > and increment a counter for the duplicate entries). > > > > And could this be done w/o modifying the original source, maybe through > > some sort of event/listener? >
+
S Ahmed 2012-05-15, 15:42
-
Re: filter before flush to disk
S Ahmed 2012-05-15, 15:43
One downside is if my logic was messed up, I don't have a timeframe of rolling the logic back (which was one of the benefits of kafka's design choice of having messages kept around for x days).
On Tue, May 15, 2012 at 11:42 AM, S Ahmed <[EMAIL PROTECTED]> wrote:
> What do you mean? > > " I think the direction we are going > is instead to just let you co-locate this processing on the same box. > This gives the isolation of separate processes and the overhead of the > transfer over localhost is pretty minor. " > > > I see what your saying as it is a specific implemention/use case that > diverts from a general purpose mechanism, that's why I was suggesting maybe > a hook/event based system. > > > On Tue, May 15, 2012 at 11:24 AM, Jay Kreps <[EMAIL PROTECTED]> wrote: > >> Yeah I see where you are going with that. We toyed with this idea, but >> the idea of coupling processing to the log storage raises a lot of >> problems for general purpose usage. I think the direction we are going >> is instead to just let you co-locate this processing on the same box. >> This gives the isolation of separate processes and the overhead of the >> transfer over localhost is pretty minor. >> >> -Jay >> >> On Tue, May 15, 2012 at 6:38 AM, S Ahmed <[EMAIL PROTECTED]> wrote: >> > Would it be possible to filter the collection before it gets flush to >> disk? >> > >> > Say I am tracking page views per user, and I could perform a rollup >> before >> > it gets flushed to disk (using a hashmap with the key being the >> sessionId, >> > and increment a counter for the duplicate entries). >> > >> > And could this be done w/o modifying the original source, maybe through >> > some sort of event/listener? >> > >
+
S Ahmed 2012-05-15, 15:43
-
Re: filter before flush to disk
S Ahmed 2012-05-17, 13:40
Oh, maybe this isn't possible again since the object is mapped to a file, and it may already have flushed data at the os level?
On Tue, May 15, 2012 at 11:43 AM, S Ahmed <[EMAIL PROTECTED]> wrote:
> One downside is if my logic was messed up, I don't have a timeframe of > rolling the logic back (which was one of the benefits of kafka's design > choice of having messages kept around for x days). > > > On Tue, May 15, 2012 at 11:42 AM, S Ahmed <[EMAIL PROTECTED]> wrote: > >> What do you mean? >> >> " I think the direction we are going >> is instead to just let you co-locate this processing on the same box. >> This gives the isolation of separate processes and the overhead of the >> transfer over localhost is pretty minor. " >> >> >> I see what your saying as it is a specific implemention/use case that >> diverts from a general purpose mechanism, that's why I was suggesting maybe >> a hook/event based system. >> >> >> On Tue, May 15, 2012 at 11:24 AM, Jay Kreps <[EMAIL PROTECTED]> wrote: >> >>> Yeah I see where you are going with that. We toyed with this idea, but >>> the idea of coupling processing to the log storage raises a lot of >>> problems for general purpose usage. I think the direction we are going >>> is instead to just let you co-locate this processing on the same box. >>> This gives the isolation of separate processes and the overhead of the >>> transfer over localhost is pretty minor. >>> >>> -Jay >>> >>> On Tue, May 15, 2012 at 6:38 AM, S Ahmed <[EMAIL PROTECTED]> wrote: >>> > Would it be possible to filter the collection before it gets flush to >>> disk? >>> > >>> > Say I am tracking page views per user, and I could perform a rollup >>> before >>> > it gets flushed to disk (using a hashmap with the key being the >>> sessionId, >>> > and increment a counter for the duplicate entries). >>> > >>> > And could this be done w/o modifying the original source, maybe through >>> > some sort of event/listener? >>> >> >> >
+
S Ahmed 2012-05-17, 13:40
-
Re: filter before flush to disk
Jay Kreps 2012-05-17, 15:02
I think there is no inherent reason we couldn't include a "transformation" plug in that runs before data is written. But after some bad experiences I am kind of fundamentally against allowing application code into the infrastructure process. Can you flesh out the use case a little more with some example? Wouldn't doing a post-aggregation and re-publication to another topic work just as well?
-Jay
On Thu, May 17, 2012 at 6:40 AM, S Ahmed <[EMAIL PROTECTED]> wrote: > Oh, maybe this isn't possible again since the object is mapped to a file, > and it may already have flushed data at the os level? > > On Tue, May 15, 2012 at 11:43 AM, S Ahmed <[EMAIL PROTECTED]> wrote: > >> One downside is if my logic was messed up, I don't have a timeframe of >> rolling the logic back (which was one of the benefits of kafka's design >> choice of having messages kept around for x days). >> >> >> On Tue, May 15, 2012 at 11:42 AM, S Ahmed <[EMAIL PROTECTED]> wrote: >> >>> What do you mean? >>> >>> " I think the direction we are going >>> is instead to just let you co-locate this processing on the same box. >>> This gives the isolation of separate processes and the overhead of the >>> transfer over localhost is pretty minor. " >>> >>> >>> I see what your saying as it is a specific implemention/use case that >>> diverts from a general purpose mechanism, that's why I was suggesting maybe >>> a hook/event based system. >>> >>> >>> On Tue, May 15, 2012 at 11:24 AM, Jay Kreps <[EMAIL PROTECTED]> wrote: >>> >>>> Yeah I see where you are going with that. We toyed with this idea, but >>>> the idea of coupling processing to the log storage raises a lot of >>>> problems for general purpose usage. I think the direction we are going >>>> is instead to just let you co-locate this processing on the same box. >>>> This gives the isolation of separate processes and the overhead of the >>>> transfer over localhost is pretty minor. >>>> >>>> -Jay >>>> >>>> On Tue, May 15, 2012 at 6:38 AM, S Ahmed <[EMAIL PROTECTED]> wrote: >>>> > Would it be possible to filter the collection before it gets flush to >>>> disk? >>>> > >>>> > Say I am tracking page views per user, and I could perform a rollup >>>> before >>>> > it gets flushed to disk (using a hashmap with the key being the >>>> sessionId, >>>> > and increment a counter for the duplicate entries). >>>> > >>>> > And could this be done w/o modifying the original source, maybe through >>>> > some sort of event/listener? >>>> >>> >>> >>
+
Jay Kreps 2012-05-17, 15:02
-
Re: filter before flush to disk
S Ahmed 2012-05-17, 21:32
Say I am storing messages like this:
sessionID, year-month-day-hour-minute-second, value
Now say I only need to stats at the minute level, or hour level, this means that i could save allot of hard drive space by rolling it up before it gets persisted to disk.
i.e. I could roll up hundreds of messages per sessionId to a single message.
That's pretty much it, and maybe your right it is mixing things and others might not thing it is useful. On Thu, May 17, 2012 at 11:02 AM, Jay Kreps <[EMAIL PROTECTED]> wrote:
> I think there is no inherent reason we couldn't include a > "transformation" plug in that runs before data is written. But after > some bad experiences I am kind of fundamentally against allowing > application code into the infrastructure process. Can you flesh out > the use case a little more with some example? Wouldn't doing a > post-aggregation and re-publication to another topic work just as > well? > > -Jay > > On Thu, May 17, 2012 at 6:40 AM, S Ahmed <[EMAIL PROTECTED]> wrote: > > Oh, maybe this isn't possible again since the object is mapped to a file, > > and it may already have flushed data at the os level? > > > > On Tue, May 15, 2012 at 11:43 AM, S Ahmed <[EMAIL PROTECTED]> wrote: > > > >> One downside is if my logic was messed up, I don't have a timeframe of > >> rolling the logic back (which was one of the benefits of kafka's design > >> choice of having messages kept around for x days). > >> > >> > >> On Tue, May 15, 2012 at 11:42 AM, S Ahmed <[EMAIL PROTECTED]> wrote: > >> > >>> What do you mean? > >>> > >>> " I think the direction we are going > >>> is instead to just let you co-locate this processing on the same box. > >>> This gives the isolation of separate processes and the overhead of the > >>> transfer over localhost is pretty minor. " > >>> > >>> > >>> I see what your saying as it is a specific implemention/use case that > >>> diverts from a general purpose mechanism, that's why I was suggesting > maybe > >>> a hook/event based system. > >>> > >>> > >>> On Tue, May 15, 2012 at 11:24 AM, Jay Kreps <[EMAIL PROTECTED]> > wrote: > >>> > >>>> Yeah I see where you are going with that. We toyed with this idea, but > >>>> the idea of coupling processing to the log storage raises a lot of > >>>> problems for general purpose usage. I think the direction we are going > >>>> is instead to just let you co-locate this processing on the same box. > >>>> This gives the isolation of separate processes and the overhead of the > >>>> transfer over localhost is pretty minor. > >>>> > >>>> -Jay > >>>> > >>>> On Tue, May 15, 2012 at 6:38 AM, S Ahmed <[EMAIL PROTECTED]> > wrote: > >>>> > Would it be possible to filter the collection before it gets flush > to > >>>> disk? > >>>> > > >>>> > Say I am tracking page views per user, and I could perform a rollup > >>>> before > >>>> > it gets flushed to disk (using a hashmap with the key being the > >>>> sessionId, > >>>> > and increment a counter for the duplicate entries). > >>>> > > >>>> > And could this be done w/o modifying the original source, maybe > through > >>>> > some sort of event/listener? > >>>> > >>> > >>> > >> >
+
S Ahmed 2012-05-17, 21:32
-
Re: filter before flush to disk
Jay Kreps 2012-05-17, 22:34
Yeah so our current recommendation would be to do that as post processing as a consumer. It can store its results back to another topic if needed. This gives a clean seperation between the log of incoming data and the aggregation process. If you co-locate these things (same machine differrent process) the overhead should be pretty small.
-Jay
On Thu, May 17, 2012 at 2:32 PM, S Ahmed <[EMAIL PROTECTED]> wrote: > Say I am storing messages like this: > > sessionID, year-month-day-hour-minute-second, value > > Now say I only need to stats at the minute level, or hour level, this means > that i could save allot of hard drive space by rolling it up before it gets > persisted to disk. > > i.e. I could roll up hundreds of messages per sessionId to a single message. > > That's pretty much it, and maybe your right it is mixing things and others > might not thing it is useful. > > > On Thu, May 17, 2012 at 11:02 AM, Jay Kreps <[EMAIL PROTECTED]> wrote: > >> I think there is no inherent reason we couldn't include a >> "transformation" plug in that runs before data is written. But after >> some bad experiences I am kind of fundamentally against allowing >> application code into the infrastructure process. Can you flesh out >> the use case a little more with some example? Wouldn't doing a >> post-aggregation and re-publication to another topic work just as >> well? >> >> -Jay >> >> On Thu, May 17, 2012 at 6:40 AM, S Ahmed <[EMAIL PROTECTED]> wrote: >> > Oh, maybe this isn't possible again since the object is mapped to a file, >> > and it may already have flushed data at the os level? >> > >> > On Tue, May 15, 2012 at 11:43 AM, S Ahmed <[EMAIL PROTECTED]> wrote: >> > >> >> One downside is if my logic was messed up, I don't have a timeframe of >> >> rolling the logic back (which was one of the benefits of kafka's design >> >> choice of having messages kept around for x days). >> >> >> >> >> >> On Tue, May 15, 2012 at 11:42 AM, S Ahmed <[EMAIL PROTECTED]> wrote: >> >> >> >>> What do you mean? >> >>> >> >>> " I think the direction we are going >> >>> is instead to just let you co-locate this processing on the same box. >> >>> This gives the isolation of separate processes and the overhead of the >> >>> transfer over localhost is pretty minor. " >> >>> >> >>> >> >>> I see what your saying as it is a specific implemention/use case that >> >>> diverts from a general purpose mechanism, that's why I was suggesting >> maybe >> >>> a hook/event based system. >> >>> >> >>> >> >>> On Tue, May 15, 2012 at 11:24 AM, Jay Kreps <[EMAIL PROTECTED]> >> wrote: >> >>> >> >>>> Yeah I see where you are going with that. We toyed with this idea, but >> >>>> the idea of coupling processing to the log storage raises a lot of >> >>>> problems for general purpose usage. I think the direction we are going >> >>>> is instead to just let you co-locate this processing on the same box. >> >>>> This gives the isolation of separate processes and the overhead of the >> >>>> transfer over localhost is pretty minor. >> >>>> >> >>>> -Jay >> >>>> >> >>>> On Tue, May 15, 2012 at 6:38 AM, S Ahmed <[EMAIL PROTECTED]> >> wrote: >> >>>> > Would it be possible to filter the collection before it gets flush >> to >> >>>> disk? >> >>>> > >> >>>> > Say I am tracking page views per user, and I could perform a rollup >> >>>> before >> >>>> > it gets flushed to disk (using a hashmap with the key being the >> >>>> sessionId, >> >>>> > and increment a counter for the duplicate entries). >> >>>> > >> >>>> > And could this be done w/o modifying the original source, maybe >> through >> >>>> > some sort of event/listener? >> >>>> >> >>> >> >>> >> >> >>
+
Jay Kreps 2012-05-17, 22:34
-
Re: filter before flush to disk
S Ahmed 2012-05-29, 13:30
Also another issue would be duplicate messages, since kafka doesn't guarantee that each message is unique, you would have to somehow coordinate between the consumers if a message has been accounted for or not (which again makes another point for not filtering pre flush).
On Thu, May 17, 2012 at 6:34 PM, Jay Kreps <[EMAIL PROTECTED]> wrote:
> Yeah so our current recommendation would be to do that as post > processing as a consumer. It can store its results back to another > topic if needed. This gives a clean seperation between the log of > incoming data and the aggregation process. If you co-locate these > things (same machine differrent process) the overhead should be pretty > small. > > -Jay > > On Thu, May 17, 2012 at 2:32 PM, S Ahmed <[EMAIL PROTECTED]> wrote: > > Say I am storing messages like this: > > > > sessionID, year-month-day-hour-minute-second, value > > > > Now say I only need to stats at the minute level, or hour level, this > means > > that i could save allot of hard drive space by rolling it up before it > gets > > persisted to disk. > > > > i.e. I could roll up hundreds of messages per sessionId to a single > message. > > > > That's pretty much it, and maybe your right it is mixing things and > others > > might not thing it is useful. > > > > > > On Thu, May 17, 2012 at 11:02 AM, Jay Kreps <[EMAIL PROTECTED]> wrote: > > > >> I think there is no inherent reason we couldn't include a > >> "transformation" plug in that runs before data is written. But after > >> some bad experiences I am kind of fundamentally against allowing > >> application code into the infrastructure process. Can you flesh out > >> the use case a little more with some example? Wouldn't doing a > >> post-aggregation and re-publication to another topic work just as > >> well? > >> > >> -Jay > >> > >> On Thu, May 17, 2012 at 6:40 AM, S Ahmed <[EMAIL PROTECTED]> wrote: > >> > Oh, maybe this isn't possible again since the object is mapped to a > file, > >> > and it may already have flushed data at the os level? > >> > > >> > On Tue, May 15, 2012 at 11:43 AM, S Ahmed <[EMAIL PROTECTED]> > wrote: > >> > > >> >> One downside is if my logic was messed up, I don't have a timeframe > of > >> >> rolling the logic back (which was one of the benefits of kafka's > design > >> >> choice of having messages kept around for x days). > >> >> > >> >> > >> >> On Tue, May 15, 2012 at 11:42 AM, S Ahmed <[EMAIL PROTECTED]> > wrote: > >> >> > >> >>> What do you mean? > >> >>> > >> >>> " I think the direction we are going > >> >>> is instead to just let you co-locate this processing on the same > box. > >> >>> This gives the isolation of separate processes and the overhead of > the > >> >>> transfer over localhost is pretty minor. " > >> >>> > >> >>> > >> >>> I see what your saying as it is a specific implemention/use case > that > >> >>> diverts from a general purpose mechanism, that's why I was > suggesting > >> maybe > >> >>> a hook/event based system. > >> >>> > >> >>> > >> >>> On Tue, May 15, 2012 at 11:24 AM, Jay Kreps <[EMAIL PROTECTED]> > >> wrote: > >> >>> > >> >>>> Yeah I see where you are going with that. We toyed with this idea, > but > >> >>>> the idea of coupling processing to the log storage raises a lot of > >> >>>> problems for general purpose usage. I think the direction we are > going > >> >>>> is instead to just let you co-locate this processing on the same > box. > >> >>>> This gives the isolation of separate processes and the overhead of > the > >> >>>> transfer over localhost is pretty minor. > >> >>>> > >> >>>> -Jay > >> >>>> > >> >>>> On Tue, May 15, 2012 at 6:38 AM, S Ahmed <[EMAIL PROTECTED]> > >> wrote: > >> >>>> > Would it be possible to filter the collection before it gets > flush > >> to > >> >>>> disk? > >> >>>> > > >> >>>> > Say I am tracking page views per user, and I could perform a > rollup > >> >>>> before > >> >>>> > it gets flushed to disk (using a hashmap with the key being the > >> >>>> sessionId, > >> >
+
S Ahmed 2012-05-29, 13:30
|
|