Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> filter before flush to disk


Copy link to this message
-
Re: filter before flush to disk
Oh, maybe this isn't possible again since the object is mapped to a file,
and it may already have flushed data at the os level?

On Tue, May 15, 2012 at 11:43 AM, S Ahmed <[EMAIL PROTECTED]> wrote:

> One downside is if my logic was messed up, I don't have a timeframe of
> rolling the logic back (which was one of the benefits of kafka's design
> choice of having messages kept around for x days).
>
>
> On Tue, May 15, 2012 at 11:42 AM, S Ahmed <[EMAIL PROTECTED]> wrote:
>
>> What do you mean?
>>
>> "  I think the direction we are going
>> is instead to just let you co-locate this processing on the same box.
>> This gives the isolation of separate processes and the overhead of the
>> transfer over localhost is pretty minor. "
>>
>>
>> I see what your saying as it is a specific implemention/use case that
>> diverts from a general purpose mechanism, that's why I was suggesting maybe
>> a hook/event based system.
>>
>>
>> On Tue, May 15, 2012 at 11:24 AM, Jay Kreps <[EMAIL PROTECTED]> wrote:
>>
>>> Yeah I see where you are going with that. We toyed with this idea, but
>>> the idea of coupling processing to the log storage raises a lot of
>>> problems for general purpose usage. I think the direction we are going
>>> is instead to just let you co-locate this processing on the same box.
>>> This gives the isolation of separate processes and the overhead of the
>>> transfer over localhost is pretty minor.
>>>
>>> -Jay
>>>
>>> On Tue, May 15, 2012 at 6:38 AM, S Ahmed <[EMAIL PROTECTED]> wrote:
>>> > Would it be possible to filter the collection before it gets flush to
>>> disk?
>>> >
>>> > Say I am tracking page views per user, and I could perform a rollup
>>> before
>>> > it gets flushed to disk (using a hashmap with the key being the
>>> sessionId,
>>> > and increment a counter for the duplicate entries).
>>> >
>>> > And could this be done w/o modifying the original source, maybe through
>>> > some sort of event/listener?
>>>
>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB