Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka, mail # user - filter before flush to disk


+
S Ahmed 2012-05-15, 13:38
+
Jay Kreps 2012-05-15, 15:24
+
S Ahmed 2012-05-15, 15:42
+
S Ahmed 2012-05-15, 15:43
+
S Ahmed 2012-05-17, 13:40
+
Jay Kreps 2012-05-17, 15:02
+
S Ahmed 2012-05-17, 21:32
+
Jay Kreps 2012-05-17, 22:34
Copy link to this message
-
Re: filter before flush to disk
S Ahmed 2012-05-29, 13:30
Also another issue would be duplicate messages, since kafka doesn't
guarantee that each message is unique, you would have to somehow coordinate
between the consumers if a message has been accounted for or not (which
again makes another point for not filtering pre flush).

On Thu, May 17, 2012 at 6:34 PM, Jay Kreps <[EMAIL PROTECTED]> wrote:

> Yeah so our current recommendation would be to do that as post
> processing as a consumer. It can store its results back to another
> topic if needed. This gives a clean seperation between the log of
> incoming data and the aggregation process. If you co-locate these
> things (same machine differrent process) the overhead should be pretty
> small.
>
> -Jay
>
> On Thu, May 17, 2012 at 2:32 PM, S Ahmed <[EMAIL PROTECTED]> wrote:
> > Say I am storing messages like this:
> >
> > sessionID, year-month-day-hour-minute-second, value
> >
> > Now say I only need to stats at the minute level, or hour level, this
> means
> > that i could save allot of hard drive space by rolling it up before it
> gets
> > persisted to disk.
> >
> > i.e. I could roll up hundreds of messages per sessionId to a single
> message.
> >
> > That's pretty much it, and maybe your right it is mixing things and
> others
> > might not thing it is useful.
> >
> >
> > On Thu, May 17, 2012 at 11:02 AM, Jay Kreps <[EMAIL PROTECTED]> wrote:
> >
> >> I think there is no inherent reason we couldn't include a
> >> "transformation" plug in that runs before data is written. But after
> >> some bad experiences I am kind of fundamentally against allowing
> >> application code into the infrastructure process. Can you flesh out
> >> the use case a little more with some example? Wouldn't doing a
> >> post-aggregation and re-publication to another topic work just as
> >> well?
> >>
> >> -Jay
> >>
> >> On Thu, May 17, 2012 at 6:40 AM, S Ahmed <[EMAIL PROTECTED]> wrote:
> >> > Oh, maybe this isn't possible again since the object is mapped to a
> file,
> >> > and it may already have flushed data at the os level?
> >> >
> >> > On Tue, May 15, 2012 at 11:43 AM, S Ahmed <[EMAIL PROTECTED]>
> wrote:
> >> >
> >> >> One downside is if my logic was messed up, I don't have a timeframe
> of
> >> >> rolling the logic back (which was one of the benefits of kafka's
> design
> >> >> choice of having messages kept around for x days).
> >> >>
> >> >>
> >> >> On Tue, May 15, 2012 at 11:42 AM, S Ahmed <[EMAIL PROTECTED]>
> wrote:
> >> >>
> >> >>> What do you mean?
> >> >>>
> >> >>> "  I think the direction we are going
> >> >>> is instead to just let you co-locate this processing on the same
> box.
> >> >>> This gives the isolation of separate processes and the overhead of
> the
> >> >>> transfer over localhost is pretty minor. "
> >> >>>
> >> >>>
> >> >>> I see what your saying as it is a specific implemention/use case
> that
> >> >>> diverts from a general purpose mechanism, that's why I was
> suggesting
> >> maybe
> >> >>> a hook/event based system.
> >> >>>
> >> >>>
> >> >>> On Tue, May 15, 2012 at 11:24 AM, Jay Kreps <[EMAIL PROTECTED]>
> >> wrote:
> >> >>>
> >> >>>> Yeah I see where you are going with that. We toyed with this idea,
> but
> >> >>>> the idea of coupling processing to the log storage raises a lot of
> >> >>>> problems for general purpose usage. I think the direction we are
> going
> >> >>>> is instead to just let you co-locate this processing on the same
> box.
> >> >>>> This gives the isolation of separate processes and the overhead of
> the
> >> >>>> transfer over localhost is pretty minor.
> >> >>>>
> >> >>>> -Jay
> >> >>>>
> >> >>>> On Tue, May 15, 2012 at 6:38 AM, S Ahmed <[EMAIL PROTECTED]>
> >> wrote:
> >> >>>> > Would it be possible to filter the collection before it gets
> flush
> >> to
> >> >>>> disk?
> >> >>>> >
> >> >>>> > Say I am tracking page views per user, and I could perform a
> rollup
> >> >>>> before
> >> >>>> > it gets flushed to disk (using a hashmap with the key being the
> >> >>>> sessionId,
> >> >