|
|
-
Architecture Consulting
Guy Doulberg 2012-06-19, 13:12
Hi all,
We'd like to consult with you about our Kafka architecture,
We have Http endpoints that receive events from the web, and push them into the system via kafka. The events are distinguishable by their HTTP url, and are sharded to their corresponding topics.
We have 2 designs in mind:
1. One main 'raw' topic, split to multiple enriched topics. The endpoints write to one kafka topic, lets call it 'Raw topic'. The above 'raw topic' is consumed by some kafka consumer which does the following: i - enrich the data (extract ip-to-location info, standardize browser/os type, etc) ii -feed the enriched data to a new topic, based on the referrer information.
2. Multiple 'raw' topics each fed to its corresponding 'enriched' topic. Have the web endpoints shard the events based on their referrer, creating multiple 'raw' topics, one per referrer type/domain. Each 'raw' topic is then consumed, and a corresponding enriched stream/topic is created from it.
The dilemma is weather to do the separation to topics as soon as we can, at the web endpoint (option 2) or to postpone it as much as possible (option 1).....
I prefer option 1 , but tests I ran, reveaI that in a scenario where there are many event types in the same topic, and some event types have many more occurrences than others, the more frequent event types seem to "drown" the less common ones, which roughly translates to the fact that less common events may appear at their consumer side much later in time than the more frequent ones. If my system requires a 'timely' processing of events, this behaviour poses a problem.
What do you think? thanks
+
Guy Doulberg 2012-06-19, 13:12
-
Re: Architecture Consulting
Tim Lossen 2012-06-19, 15:53
well, we decided to go with one topic per game (approach 2), as there are some consumers which are only interested in data from a single topic. makes it a bit harder for consumers interested in processing ALL events though. not knowing more about your concrete situation, it is difficult to decide what is better in your case. cheers tim On 2012-06-19, at 15:12 , Guy Doulberg wrote: > Hi all, > > We'd like to consult with you about our Kafka architecture, > > We have Http endpoints that receive events from the web, and push them into the system via kafka. The events are distinguishable by their HTTP url, and are sharded to their corresponding topics. > > We have 2 designs in mind: > > 1. One main 'raw' topic, split to multiple enriched topics. > The endpoints write to one kafka topic, lets call it 'Raw topic'. > The above 'raw topic' is consumed by some kafka consumer which does the following: > i - enrich the data (extract ip-to-location info, standardize browser/os type, etc) > ii -feed the enriched data to a new topic, based on the referrer information. > > 2. Multiple 'raw' topics each fed to its corresponding 'enriched' topic. > Have the web endpoints shard the events based on their referrer, creating multiple 'raw' topics, one per referrer type/domain. > Each 'raw' topic is then consumed, and a corresponding enriched stream/topic is created from it. > > The dilemma is weather to do the separation to topics as soon as we can, at the web endpoint (option 2) > or to postpone it as much as possible (option 1)..... > > I prefer option 1 , but tests I ran, reveaI that in a scenario where there are many event types in the same topic, and some event types have many more occurrences than others, the more frequent event types seem to "drown" the less common ones, which roughly translates to the fact that less common events may appear at their consumer side much later in time than the more frequent ones. > If my system requires a 'timely' processing of events, this behaviour poses a problem. > > What do you think? thanks > -- http://tim.lossen.de
+
Tim Lossen 2012-06-19, 15:53
-
RE: Architecture Consulting
Guy Doulberg 2012-06-19, 20:09
Hןi Tom, Thanks for you replay, Do in your implementation you have enrichment process? If so, how do you perform the enrichment on each of the topics? Thanks, Guy ________________________________________ מאת: Tim Lossen [[EMAIL PROTECTED]] נשלח: יום שלישי 19 יוני 2012 18:53 אל: [EMAIL PROTECTED] נושא: Re: Architecture Consulting well, we decided to go with one topic per game (approach 2), as there are some consumers which are only interested in data from a single topic. makes it a bit harder for consumers interested in processing ALL events though. not knowing more about your concrete situation, it is difficult to decide what is better in your case. cheers tim On 2012-06-19, at 15:12 , Guy Doulberg wrote: > Hi all, > > We'd like to consult with you about our Kafka architecture, > > We have Http endpoints that receive events from the web, and push them into the system via kafka. The events are distinguishable by their HTTP url, and are sharded to their corresponding topics. > > We have 2 designs in mind: > > 1. One main 'raw' topic, split to multiple enriched topics. > The endpoints write to one kafka topic, lets call it 'Raw topic'. > The above 'raw topic' is consumed by some kafka consumer which does the following: > i - enrich the data (extract ip-to-location info, standardize browser/os type, etc) > ii -feed the enriched data to a new topic, based on the referrer information. > > 2. Multiple 'raw' topics each fed to its corresponding 'enriched' topic. > Have the web endpoints shard the events based on their referrer, creating multiple 'raw' topics, one per referrer type/domain. > Each 'raw' topic is then consumed, and a corresponding enriched stream/topic is created from it. > > The dilemma is weather to do the separation to topics as soon as we can, at the web endpoint (option 2) > or to postpone it as much as possible (option 1)..... > > I prefer option 1 , but tests I ran, reveaI that in a scenario where there are many event types in the same topic, and some event types have many more occurrences than others, the more frequent event types seem to "drown" the less common ones, which roughly translates to the fact that less common events may appear at their consumer side much later in time than the more frequent ones. > If my system requires a 'timely' processing of events, this behaviour poses a problem. > > What do you think? thanks > -- http://tim.lossen.de
+
Guy Doulberg 2012-06-19, 20:09
-
Re: RE: Architecture Consulting
Tim Lossen 2012-06-19, 22:22
no, we do not preprocess and republish the events, although we have toyed with the idea. currently, all our consumers do their own preprocessing (ip lookup etc.). cheers tim On 2012-06-19, at 10:09 PM, Guy Doulberg wrote: > Hןi Tom, > Thanks for you replay, > > Do in your implementation you have enrichment process? > If so, how do you perform the enrichment on each of the topics? > > > Thanks, Guy > > ________________________________________ > מאת: Tim Lossen [[EMAIL PROTECTED]] > נשלח: יום שלישי 19 יוני 2012 18:53 > אל: [EMAIL PROTECTED] > נושא: Re: Architecture Consulting > > well, we decided to go with one topic per game (approach 2), > as there are some consumers which are only interested in data > from a single topic. makes it a bit harder for consumers interested > in processing ALL events though. > > not knowing more about your concrete situation, it is difficult > to decide what is better in your case. > > cheers > tim > > > On 2012-06-19, at 15:12 , Guy Doulberg wrote: > >> Hi all, >> >> We'd like to consult with you about our Kafka architecture, >> >> We have Http endpoints that receive events from the web, and push >> them into the system via kafka. The events are distinguishable by >> their HTTP url, and are sharded to their corresponding topics. >> >> We have 2 designs in mind: >> >> 1. One main 'raw' topic, split to multiple enriched topics. >> The endpoints write to one kafka topic, lets call it 'Raw topic'. >> The above 'raw topic' is consumed by some kafka consumer which does >> the following: >> i - enrich the data (extract ip-to-location info, standardize >> browser/os type, etc) >> ii -feed the enriched data to a new topic, based on the referrer >> information. >> >> 2. Multiple 'raw' topics each fed to its corresponding 'enriched' >> topic. >> Have the web endpoints shard the events based on their referrer, >> creating multiple 'raw' topics, one per referrer type/domain. >> Each 'raw' topic is then consumed, and a corresponding enriched >> stream/topic is created from it. >> >> The dilemma is weather to do the separation to topics as soon as >> we can, at the web endpoint (option 2) >> or to postpone it as much as possible (option 1)..... >> >> I prefer option 1 , but tests I ran, reveaI that in a scenario >> where there are many event types in the same topic, and some event >> types have many more occurrences than others, the more frequent >> event types seem to "drown" the less common ones, which roughly >> translates to the fact that less common events may appear at their >> consumer side much later in time than the more frequent ones. >> If my system requires a 'timely' processing of events, this >> behaviour poses a problem. >> >> What do you think? thanks >> > > -- > http://tim.lossen.de> > > -- http://tim.lossen.de
+
Tim Lossen 2012-06-19, 22:22
-
Re: Architecture Consulting
Jun Rao 2012-06-19, 16:14
Guy,
In approach 1, on thing that you can try is to speed up the consumer by increasing the degree of parallelism (i.e., more partitions per topic).
Thanks,
Jun
On Tue, Jun 19, 2012 at 6:12 AM, Guy Doulberg <[EMAIL PROTECTED]>wrote:
> Hi all, > > We'd like to consult with you about our Kafka architecture, > > We have Http endpoints that receive events from the web, and push them > into the system via kafka. The events are distinguishable by their HTTP > url, and are sharded to their corresponding topics. > > We have 2 designs in mind: > > 1. One main 'raw' topic, split to multiple enriched topics. > The endpoints write to one kafka topic, lets call it 'Raw topic'. > The above 'raw topic' is consumed by some kafka consumer which does the > following: > i - enrich the data (extract ip-to-location info, standardize browser/os > type, etc) > ii -feed the enriched data to a new topic, based on the referrer > information. > > 2. Multiple 'raw' topics each fed to its corresponding 'enriched' topic. > Have the web endpoints shard the events based on their referrer, creating > multiple 'raw' topics, one per referrer type/domain. > Each 'raw' topic is then consumed, and a corresponding enriched > stream/topic is created from it. > > The dilemma is weather to do the separation to topics as soon as we can, > at the web endpoint (option 2) > or to postpone it as much as possible (option 1)..... > > I prefer option 1 , but tests I ran, reveaI that in a scenario where there > are many event types in the same topic, and some event types have many more > occurrences than others, the more frequent event types seem to "drown" the > less common ones, which roughly translates to the fact that less common > events may appear at their consumer side much later in time than the more > frequent ones. > If my system requires a 'timely' processing of events, this behaviour > poses a problem. > > What do you think? thanks > >
+
Jun Rao 2012-06-19, 16:14
-
Re: Architecture Consulting
Tarun Kumar 2012-06-20, 02:51
Hi,
Approach 2 seems better to me. With more topics (and multiple partitions for those topics), you will get better parallelism. This approach will provide better granularity and better control as message level.
But, again this whole thing depends on your use case. Thanks. On Tue, Jun 19, 2012 at 6:12 AM, Guy Doulberg <[EMAIL PROTECTED] > >wrote: > > > Hi all, > > > > We'd like to consult with you about our Kafka architecture, > > > > We have Http endpoints that receive events from the web, and push them > > into the system via kafka. The events are distinguishable by their HTTP > > url, and are sharded to their corresponding topics. > > > > We have 2 designs in mind: > > > > 1. One main 'raw' topic, split to multiple enriched topics. > > The endpoints write to one kafka topic, lets call it 'Raw topic'. > > The above 'raw topic' is consumed by some kafka consumer which does the > > following: > > i - enrich the data (extract ip-to-location info, standardize browser/os > > type, etc) > > ii -feed the enriched data to a new topic, based on the referrer > > information. > > > > 2. Multiple 'raw' topics each fed to its corresponding 'enriched' topic. > > Have the web endpoints shard the events based on their referrer, creating > > multiple 'raw' topics, one per referrer type/domain. > > Each 'raw' topic is then consumed, and a corresponding enriched > > stream/topic is created from it. > > > > The dilemma is weather to do the separation to topics as soon as we can, > > at the web endpoint (option 2) > > or to postpone it as much as possible (option 1)..... > > > > I prefer option 1 , but tests I ran, reveaI that in a scenario where > there > > are many event types in the same topic, and some event types have many > more > > occurrences than others, the more frequent event types seem to "drown" > the > > less common ones, which roughly translates to the fact that less common > > events may appear at their consumer side much later in time than the more > > frequent ones. > > If my system requires a 'timely' processing of events, this behaviour > > poses a problem. > > > > What do you think? thanks > > > > >
+
Tarun Kumar 2012-06-20, 02:51
|
|