Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka, mail # user - kafka for user click tracking, how would this work?


+
S Ahmed 2012-05-02, 16:05
+
Neha Narkhede 2012-05-02, 16:31
Copy link to this message
-
Re: kafka for user click tracking, how would this work?
S Ahmed 2012-05-02, 17:55
Neha,

Why does this repartiion occurr?  Is this if a particular topic reaches a
size or # of messages, it re-balances?

If I don't care about re-partitioning, I can just write my consuming code
such that IF the userid is the same, aggregate on that key, if its a new
key, create a new entry in the diciontionary (assuming I use a dictionary,
where the key is the userId and the value is the aggregation of the
messages).

I was just aiming to be more efficient that just reading random messages.

On Wed, May 2, 2012 at 12:31 PM, Neha Narkhede <[EMAIL PROTECTED]>wrote:

> Ahmed,
>
> Your use case sounds similar to what Peter mentioned in another thread -
>
> http://mail-archives.apache.org/mod_mbox/incubator-kafka-users/201205.mbox/ajax/%3CCAEJzOMYROfvJo6u-qPJ0xLjF69Asod6zowkDKc8PpE2457nWDg%40mail.gmail.com%3E
>
> On the producer side, you can use a Partitioner to partition the Kafka
> messages by userid. This would ensure that data from a particular user
> always ends up in the same partition[1].
>
> On the consumer side, you can imagine the application that does the user
> click counting to be one Kafka consumer group. In steady state, one
> partition will always be consumed by only one of the consumers in this
> group. So you could maintain some cache to hold user click counts. However,
> when rebalancing happens, the partition could be consumed by another
> consumer. So, right before the rebalancing operation, you would want to
> flush your userid counts, so it can be picked up by the next consumer that
> would consume data from that user's partition.
>
> Thanks,
> Neha
>
> 1. Note that, the producer side sticky partitioning guarantees are not
> ideal in Kafka 0.7. This is because when brokers are bounced, partitions
> can become unavailable for some time. During this time, the user's data can
> be routed to another partition. However, with Kafka 0.8, we are working to
> add intra-cluster replication that would guarantee the availability of a
> partition even in the presence of broker failures.
>
>
> On Wed, May 2, 2012 at 9:05 AM, S Ahmed <[EMAIL PROTECTED]> wrote:
>
> > Trying to understand how kafka could be used in the following scenerio:
> >
> > Say I am creating a Saas application for website click tracking.  So a
> > client would paste some javascript on their website,
> > and any links click on their website would result in a api call that
> would
> > log the click (ip address, link meta data, timestamp, session guid, etc).
> >
> > Since these api calls are coming from remote servers, I'm guessing I
> would
> > be wrapping the calls to kafka via a http server e.g. a jetty servlet
> > handler would take the http call made via the api and then write to a
> kafka
> > topic.
> >
> > Am I right so far?
> >
> > Now how could I partition the data in a way that would make consuming
> more
> > efficient?
> > i.e. I am tracking click counts for visitors to a website, it would be
> > probable that a user will have multiple messages written to kafka in a
> > given session, so on the consumer end if I could read in batches and
> > aggregate before I write the 'rolled up' data to mysql that would be
> ideal.
> >
> > I read the kafka design page, and I understand at a high level that
> > consumers can be 'grouped'.
> >
> > Looking for someone to clarify how this usecase could be solved with
> kafka,
> > particularly
> > how partitioning and consumption works (still not 100% clear on those and
> > hopefully this sample use case will clear that up).
> >
>
+
Neha Narkhede 2012-05-02, 19:13