Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> filter before flush to disk


Copy link to this message
-
Re: filter before flush to disk
Also another issue would be duplicate messages, since kafka doesn't
guarantee that each message is unique, you would have to somehow coordinate
between the consumers if a message has been accounted for or not (which
again makes another point for not filtering pre flush).

On Thu, May 17, 2012 at 6:34 PM, Jay Kreps <[EMAIL PROTECTED]> wrote:

> Yeah so our current recommendation would be to do that as post
> processing as a consumer. It can store its results back to another
> topic if needed. This gives a clean seperation between the log of
> incoming data and the aggregation process. If you co-locate these
> things (same machine differrent process) the overhead should be pretty
> small.
>
> -Jay
>
> On Thu, May 17, 2012 at 2:32 PM, S Ahmed <[EMAIL PROTECTED]> wrote:
> > Say I am storing messages like this:
> >
> > sessionID, year-month-day-hour-minute-second, value
> >
> > Now say I only need to stats at the minute level, or hour level, this
> means
> > that i could save allot of hard drive space by rolling it up before it
> gets
> > persisted to disk.
> >
> > i.e. I could roll up hundreds of messages per sessionId to a single
> message.
> >
> > That's pretty much it, and maybe your right it is mixing things and
> others
> > might not thing it is useful.
> >
> >
> > On Thu, May 17, 2012 at 11:02 AM, Jay Kreps <[EMAIL PROTECTED]> wrote:
> >
> >> I think there is no inherent reason we couldn't include a
> >> "transformation" plug in that runs before data is written. But after
> >> some bad experiences I am kind of fundamentally against allowing
> >> application code into the infrastructure process. Can you flesh out
> >> the use case a little more with some example? Wouldn't doing a
> >> post-aggregation and re-publication to another topic work just as
> >> well?
> >>
> >> -Jay
> >>
> >> On Thu, May 17, 2012 at 6:40 AM, S Ahmed <[EMAIL PROTECTED]> wrote:
> >> > Oh, maybe this isn't possible again since the object is mapped to a
> file,
> >> > and it may already have flushed data at the os level?
> >> >
> >> > On Tue, May 15, 2012 at 11:43 AM, S Ahmed <[EMAIL PROTECTED]>
> wrote:
> >> >
> >> >> One downside is if my logic was messed up, I don't have a timeframe
> of
> >> >> rolling the logic back (which was one of the benefits of kafka's
> design
> >> >> choice of having messages kept around for x days).
> >> >>
> >> >>
> >> >> On Tue, May 15, 2012 at 11:42 AM, S Ahmed <[EMAIL PROTECTED]>
> wrote:
> >> >>
> >> >>> What do you mean?
> >> >>>
> >> >>> "  I think the direction we are going
> >> >>> is instead to just let you co-locate this processing on the same
> box.
> >> >>> This gives the isolation of separate processes and the overhead of
> the
> >> >>> transfer over localhost is pretty minor. "
> >> >>>
> >> >>>
> >> >>> I see what your saying as it is a specific implemention/use case
> that
> >> >>> diverts from a general purpose mechanism, that's why I was
> suggesting
> >> maybe
> >> >>> a hook/event based system.
> >> >>>
> >> >>>
> >> >>> On Tue, May 15, 2012 at 11:24 AM, Jay Kreps <[EMAIL PROTECTED]>
> >> wrote:
> >> >>>
> >> >>>> Yeah I see where you are going with that. We toyed with this idea,
> but
> >> >>>> the idea of coupling processing to the log storage raises a lot of
> >> >>>> problems for general purpose usage. I think the direction we are
> going
> >> >>>> is instead to just let you co-locate this processing on the same
> box.
> >> >>>> This gives the isolation of separate processes and the overhead of
> the
> >> >>>> transfer over localhost is pretty minor.
> >> >>>>
> >> >>>> -Jay
> >> >>>>
> >> >>>> On Tue, May 15, 2012 at 6:38 AM, S Ahmed <[EMAIL PROTECTED]>
> >> wrote:
> >> >>>> > Would it be possible to filter the collection before it gets
> flush
> >> to
> >> >>>> disk?
> >> >>>> >
> >> >>>> > Say I am tracking page views per user, and I could perform a
> rollup
> >> >>>> before
> >> >>>> > it gets flushed to disk (using a hashmap with the key being the
> >> >>>> sessionId,
> >> >