Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka >> mail # user >> kafka for user click tracking, how would this work?


+
S Ahmed 2012-05-02, 16:05
+
Neha Narkhede 2012-05-02, 16:31
Copy link to this message
-
Re: kafka for user click tracking, how would this work?
Neha,

Why does this repartiion occurr?  Is this if a particular topic reaches a
size or # of messages, it re-balances?

If I don't care about re-partitioning, I can just write my consuming code
such that IF the userid is the same, aggregate on that key, if its a new
key, create a new entry in the diciontionary (assuming I use a dictionary,
where the key is the userId and the value is the aggregation of the
messages).

I was just aiming to be more efficient that just reading random messages.

On Wed, May 2, 2012 at 12:31 PM, Neha Narkhede <[EMAIL PROTECTED]>wrote:

> Ahmed,
>
> Your use case sounds similar to what Peter mentioned in another thread -
>
> http://mail-archives.apache.org/mod_mbox/incubator-kafka-users/201205.mbox/ajax/%3CCAEJzOMYROfvJo6u-qPJ0xLjF69Asod6zowkDKc8PpE2457nWDg%40mail.gmail.com%3E
>
> On the producer side, you can use a Partitioner to partition the Kafka
> messages by userid. This would ensure that data from a particular user
> always ends up in the same partition[1].
>
> On the consumer side, you can imagine the application that does the user
> click counting to be one Kafka consumer group. In steady state, one
> partition will always be consumed by only one of the consumers in this
> group. So you could maintain some cache to hold user click counts. However,
> when rebalancing happens, the partition could be consumed by another
> consumer. So, right before the rebalancing operation, you would want to
> flush your userid counts, so it can be picked up by the next consumer that
> would consume data from that user's partition.
>
> Thanks,
> Neha
>
> 1. Note that, the producer side sticky partitioning guarantees are not
> ideal in Kafka 0.7. This is because when brokers are bounced, partitions
> can become unavailable for some time. During this time, the user's data can
> be routed to another partition. However, with Kafka 0.8, we are working to
> add intra-cluster replication that would guarantee the availability of a
> partition even in the presence of broker failures.
>
>
> On Wed, May 2, 2012 at 9:05 AM, S Ahmed <[EMAIL PROTECTED]> wrote:
>
> > Trying to understand how kafka could be used in the following scenerio:
> >
> > Say I am creating a Saas application for website click tracking.  So a
> > client would paste some javascript on their website,
> > and any links click on their website would result in a api call that
> would
> > log the click (ip address, link meta data, timestamp, session guid, etc).
> >
> > Since these api calls are coming from remote servers, I'm guessing I
> would
> > be wrapping the calls to kafka via a http server e.g. a jetty servlet
> > handler would take the http call made via the api and then write to a
> kafka
> > topic.
> >
> > Am I right so far?
> >
> > Now how could I partition the data in a way that would make consuming
> more
> > efficient?
> > i.e. I am tracking click counts for visitors to a website, it would be
> > probable that a user will have multiple messages written to kafka in a
> > given session, so on the consumer end if I could read in batches and
> > aggregate before I write the 'rolled up' data to mysql that would be
> ideal.
> >
> > I read the kafka design page, and I understand at a high level that
> > consumers can be 'grouped'.
> >
> > Looking for someone to clarify how this usecase could be solved with
> kafka,
> > particularly
> > how partitioning and consumption works (still not 100% clear on those and
> > hopefully this sample use case will clear that up).
> >
>
+
Neha Narkhede 2012-05-02, 19:13
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB