Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka >> mail # user >> filter before flush to disk


+
S Ahmed 2012-05-15, 13:38
+
Jay Kreps 2012-05-15, 15:24
+
S Ahmed 2012-05-15, 15:42
+
S Ahmed 2012-05-15, 15:43
+
S Ahmed 2012-05-17, 13:40
+
Jay Kreps 2012-05-17, 15:02
Copy link to this message
-
Re: filter before flush to disk
Say I am storing messages like this:

sessionID, year-month-day-hour-minute-second, value

Now say I only need to stats at the minute level, or hour level, this means
that i could save allot of hard drive space by rolling it up before it gets
persisted to disk.

i.e. I could roll up hundreds of messages per sessionId to a single message.

That's pretty much it, and maybe your right it is mixing things and others
might not thing it is useful.
On Thu, May 17, 2012 at 11:02 AM, Jay Kreps <[EMAIL PROTECTED]> wrote:

> I think there is no inherent reason we couldn't include a
> "transformation" plug in that runs before data is written. But after
> some bad experiences I am kind of fundamentally against allowing
> application code into the infrastructure process. Can you flesh out
> the use case a little more with some example? Wouldn't doing a
> post-aggregation and re-publication to another topic work just as
> well?
>
> -Jay
>
> On Thu, May 17, 2012 at 6:40 AM, S Ahmed <[EMAIL PROTECTED]> wrote:
> > Oh, maybe this isn't possible again since the object is mapped to a file,
> > and it may already have flushed data at the os level?
> >
> > On Tue, May 15, 2012 at 11:43 AM, S Ahmed <[EMAIL PROTECTED]> wrote:
> >
> >> One downside is if my logic was messed up, I don't have a timeframe of
> >> rolling the logic back (which was one of the benefits of kafka's design
> >> choice of having messages kept around for x days).
> >>
> >>
> >> On Tue, May 15, 2012 at 11:42 AM, S Ahmed <[EMAIL PROTECTED]> wrote:
> >>
> >>> What do you mean?
> >>>
> >>> "  I think the direction we are going
> >>> is instead to just let you co-locate this processing on the same box.
> >>> This gives the isolation of separate processes and the overhead of the
> >>> transfer over localhost is pretty minor. "
> >>>
> >>>
> >>> I see what your saying as it is a specific implemention/use case that
> >>> diverts from a general purpose mechanism, that's why I was suggesting
> maybe
> >>> a hook/event based system.
> >>>
> >>>
> >>> On Tue, May 15, 2012 at 11:24 AM, Jay Kreps <[EMAIL PROTECTED]>
> wrote:
> >>>
> >>>> Yeah I see where you are going with that. We toyed with this idea, but
> >>>> the idea of coupling processing to the log storage raises a lot of
> >>>> problems for general purpose usage. I think the direction we are going
> >>>> is instead to just let you co-locate this processing on the same box.
> >>>> This gives the isolation of separate processes and the overhead of the
> >>>> transfer over localhost is pretty minor.
> >>>>
> >>>> -Jay
> >>>>
> >>>> On Tue, May 15, 2012 at 6:38 AM, S Ahmed <[EMAIL PROTECTED]>
> wrote:
> >>>> > Would it be possible to filter the collection before it gets flush
> to
> >>>> disk?
> >>>> >
> >>>> > Say I am tracking page views per user, and I could perform a rollup
> >>>> before
> >>>> > it gets flushed to disk (using a hashmap with the key being the
> >>>> sessionId,
> >>>> > and increment a counter for the duplicate entries).
> >>>> >
> >>>> > And could this be done w/o modifying the original source, maybe
> through
> >>>> > some sort of event/listener?
> >>>>
> >>>
> >>>
> >>
>
+
Jay Kreps 2012-05-17, 22:34
+
S Ahmed 2012-05-29, 13:30
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB