Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka, mail # user - filter before flush to disk


+
S Ahmed 2012-05-15, 13:38
+
Jay Kreps 2012-05-15, 15:24
+
S Ahmed 2012-05-15, 15:42
+
S Ahmed 2012-05-15, 15:43
+
S Ahmed 2012-05-17, 13:40
+
Jay Kreps 2012-05-17, 15:02
Copy link to this message
-
Re: filter before flush to disk
S Ahmed 2012-05-17, 21:32
Say I am storing messages like this:

sessionID, year-month-day-hour-minute-second, value

Now say I only need to stats at the minute level, or hour level, this means
that i could save allot of hard drive space by rolling it up before it gets
persisted to disk.

i.e. I could roll up hundreds of messages per sessionId to a single message.

That's pretty much it, and maybe your right it is mixing things and others
might not thing it is useful.
On Thu, May 17, 2012 at 11:02 AM, Jay Kreps <[EMAIL PROTECTED]> wrote:

> I think there is no inherent reason we couldn't include a
> "transformation" plug in that runs before data is written. But after
> some bad experiences I am kind of fundamentally against allowing
> application code into the infrastructure process. Can you flesh out
> the use case a little more with some example? Wouldn't doing a
> post-aggregation and re-publication to another topic work just as
> well?
>
> -Jay
>
> On Thu, May 17, 2012 at 6:40 AM, S Ahmed <[EMAIL PROTECTED]> wrote:
> > Oh, maybe this isn't possible again since the object is mapped to a file,
> > and it may already have flushed data at the os level?
> >
> > On Tue, May 15, 2012 at 11:43 AM, S Ahmed <[EMAIL PROTECTED]> wrote:
> >
> >> One downside is if my logic was messed up, I don't have a timeframe of
> >> rolling the logic back (which was one of the benefits of kafka's design
> >> choice of having messages kept around for x days).
> >>
> >>
> >> On Tue, May 15, 2012 at 11:42 AM, S Ahmed <[EMAIL PROTECTED]> wrote:
> >>
> >>> What do you mean?
> >>>
> >>> "  I think the direction we are going
> >>> is instead to just let you co-locate this processing on the same box.
> >>> This gives the isolation of separate processes and the overhead of the
> >>> transfer over localhost is pretty minor. "
> >>>
> >>>
> >>> I see what your saying as it is a specific implemention/use case that
> >>> diverts from a general purpose mechanism, that's why I was suggesting
> maybe
> >>> a hook/event based system.
> >>>
> >>>
> >>> On Tue, May 15, 2012 at 11:24 AM, Jay Kreps <[EMAIL PROTECTED]>
> wrote:
> >>>
> >>>> Yeah I see where you are going with that. We toyed with this idea, but
> >>>> the idea of coupling processing to the log storage raises a lot of
> >>>> problems for general purpose usage. I think the direction we are going
> >>>> is instead to just let you co-locate this processing on the same box.
> >>>> This gives the isolation of separate processes and the overhead of the
> >>>> transfer over localhost is pretty minor.
> >>>>
> >>>> -Jay
> >>>>
> >>>> On Tue, May 15, 2012 at 6:38 AM, S Ahmed <[EMAIL PROTECTED]>
> wrote:
> >>>> > Would it be possible to filter the collection before it gets flush
> to
> >>>> disk?
> >>>> >
> >>>> > Say I am tracking page views per user, and I could perform a rollup
> >>>> before
> >>>> > it gets flushed to disk (using a hashmap with the key being the
> >>>> sessionId,
> >>>> > and increment a counter for the duplicate entries).
> >>>> >
> >>>> > And could this be done w/o modifying the original source, maybe
> through
> >>>> > some sort of event/listener?
> >>>>
> >>>
> >>>
> >>
>
+
Jay Kreps 2012-05-17, 22:34
+
S Ahmed 2012-05-29, 13:30