|
Jean Bic
2012-10-20, 20:27
Joe Stein
2012-10-20, 23:25
Jean Bic
2012-10-21, 08:44
Jun Rao
2012-10-21, 21:44
Sybrandy, Casey
2012-10-22, 12:17
Neha Narkhede
2012-10-22, 17:49
Jean Bic
2012-10-22, 20:12
|
-
Kafka versus classic central HTTP(s) services for logs transmissionJean Bic 2012-10-20, 20:27
Hello,
We have started to build a solution to gather logs from many machines located in various “sites” to a so-called “Consolidation server” which role is to persists the logs and generate alerts based on some criteria (patterns in logs, triggers on some values, etc). We are challenged by our future users to clarify why Kafka is for this need the best possible communication solution. They argue that it would be better to choose a more classic HTTP(S) based solution with producers calling REST services on a pool of Node.js servers behind a load-balancer. One of the main issue they see with Kafka is that It requires connections from Consolidation Server to Kafka brokers and to Zookeeper daemons located in each “site”, versus connections from logs producers in all sites to the Consolidation servers. Here Kafka is seen as a burden for each site’s IT team requiring some firewall special setup, versus. no firewall setup with the service-based solution : 1. Kafka requires for each site IT team to create firewall rules for accepting incoming connections for a “non standard” protocol from the “Collector server” site 2. IT team must expose all Zookeeper and Broker machines/ports to the “Collector server” site 3. Kafka has no built-in encryption for data, where as a classic services oriented solution can rely on HTTPS (reverse) proxies 4. Kafka is not commonly known by IT people who do not know how to scale it: when should they add broker machines versus when should they add zookeeper machines? With the services-based solution, the IT teams of each site are free of scalability issues, only on “Consolidation server” site one has to add Node.js machine to scale up. I agree that these IT concerns can't be taken lightly. I need help from Kafka community to find rock solid assets for using Kafka over classic services-based solution. How would you “defend” Kafka against above “attacks”? Regards, Jean
-
Re: Kafka versus classic central HTTP(s) services for logs transmissionJoe Stein 2012-10-20, 23:25
You could move the producer code to the "site" and expose that as a REST interface.
You can then benefit from the scale and consumer functionality that comes with Kafka without these issues you are bringing up. On Oct 20, 2012, at 4:27 PM, Jean Bic <[EMAIL PROTECTED]> wrote: > Hello, > > We have started to build a solution to gather logs from many machines > located in various “sites” to a so-called “Consolidation server” which role > is to persists the logs and generate alerts based on some criteria > (patterns in logs, triggers on some values, etc). > > > We are challenged by our future users to clarify why Kafka is for this need > the best possible communication solution. They argue that it would be > better to choose a more classic HTTP(S) based solution with producers > calling REST services on a pool of Node.js servers behind a load-balancer. > > > One of the main issue they see with Kafka is that It requires connections > from Consolidation Server to Kafka brokers and to Zookeeper daemons located > in each “site”, versus connections from logs producers in all sites to the > Consolidation servers. > Here Kafka is seen as a burden for each site’s IT team requiring some > firewall special setup, versus. no firewall setup with the service-based > solution : > > 1. Kafka requires for each site IT team to create firewall rules for > accepting incoming connections for a “non standard” protocol from the > “Collector server” site > > 2. IT team must expose all Zookeeper and Broker machines/ports to the > “Collector server” site > > 3. Kafka has no built-in encryption for data, where as a classic services > oriented solution can rely on HTTPS (reverse) proxies > > 4. Kafka is not commonly known by IT people who do not know how to > scale it: when should they add broker machines versus when should they add > zookeeper machines? > > With the services-based solution, the IT teams of each site are free of > scalability issues, only on “Consolidation server” site one has to add > Node.js machine to scale up. > > I agree that these IT concerns can't be taken lightly. > > I need help from Kafka community to find rock solid assets for using Kafka > over classic services-based solution. > > How would you “defend” Kafka against above “attacks”? > > > Regards, > > Jean
-
Re: Kafka versus classic central HTTP(s) services for logs transmissionJean Bic 2012-10-21, 08:44
Joe:
Thanks for you answer, but we're trying to push Kafka Broker at each site... ... so your answer makes me realize why we're trying to push Kafka over per-producers services call: that would make a very large number of services call from each site (our logs producers gather data every 5 minutes, on average 100 items of about 128 bytes per machines, and we're targeting from 250 to 4000 machines per "site"). I think that, with these numbers, we have a way make IT people understand that Kafka solution will avoid flooding the site's firewall infrastructure (which is active for outbound connections). Beyond this good point for Kafka in terms of # of concurrent connections, I am wondering if we could find other assets for Kafka solution... Jean -----Original Message----- From: Joe Stein [mailto:[EMAIL PROTECTED]] Sent: Sunday, October 21, 2012 1:26 AM To: [EMAIL PROTECTED] Subject: Re: Kafka versus classic central HTTP(s) services for logs transmission You could move the producer code to the "site" and expose that as a REST interface. You can then benefit from the scale and consumer functionality that comes with Kafka without these issues you are bringing up. On Oct 20, 2012, at 4:27 PM, Jean Bic <[EMAIL PROTECTED]> wrote: > Hello, > > We have started to build a solution to gather logs from many machines > located in various “sites” to a so-called “Consolidation server” which role > is to persists the logs and generate alerts based on some criteria > (patterns in logs, triggers on some values, etc). > > > We are challenged by our future users to clarify why Kafka is for this need > the best possible communication solution. They argue that it would be > better to choose a more classic HTTP(S) based solution with producers > calling REST services on a pool of Node.js servers behind a load-balancer. > > > One of the main issue they see with Kafka is that It requires connections > from Consolidation Server to Kafka brokers and to Zookeeper daemons located > in each “site”, versus connections from logs producers in all sites to the > Consolidation servers. > Here Kafka is seen as a burden for each site’s IT team requiring some > firewall special setup, versus. no firewall setup with the service-based > solution : > > 1. Kafka requires for each site IT team to create firewall rules for > accepting incoming connections for a “non standard” protocol from the > “Collector server” site > > 2. IT team must expose all Zookeeper and Broker machines/ports to the > “Collector server” site > > 3. Kafka has no built-in encryption for data, where as a classic services > oriented solution can rely on HTTPS (reverse) proxies > > 4. Kafka is not commonly known by IT people who do not know how to > scale it: when should they add broker machines versus when should they add > zookeeper machines? > > With the services-based solution, the IT teams of each site are free of > scalability issues, only on “Consolidation server” site one has to add > Node.js machine to scale up. > > I agree that these IT concerns can't be taken lightly. > > I need help from Kafka community to find rock solid assets for using Kafka > over classic services-based solution. > > How would you “defend” Kafka against above “attacks”? > > > Regards, > > Jean
-
Re: Kafka versus classic central HTTP(s) services for logs transmissionJun Rao 2012-10-21, 21:44
Jean,
I understand your IT guys' concerns. It's true that Kafka is relatively new and is not as widely adopted as some other conventional solutions. The following are what I see as the main benefits of Kafka: a. Scalability: The system is designed to scale out. b. Throughput: Kafka supports batch API and compression, which increase the throughput of both producers and consumers. c. Integration for both offline and near line consumption: With Kafka, you can use a single system to load data into an offline system such as Hadoop as well as to consume the data in real time. d. Durability and availability: In the upcoming 0.8 release, Kafka will support intra-cluster replication, which provides both higher durability and availability at low cost. For your concern #2, in 0.8, the producer doesn't need Zookeeper any more. Instead, if relies on an RPC to get topic metadata from the brokers. We haven't looked into security related features. However, if this is a common requirement, we can add them in the future. Hope this is helpful. Thanks, Jun On Sun, Oct 21, 2012 at 1:44 AM, Jean Bic <[EMAIL PROTECTED]> wrote: > Joe: > > Thanks for you answer, but we're trying to push Kafka Broker at each > site... > ... so your answer makes me realize why we're trying to push Kafka over > per-producers services call: that would make a very large number of > services call from each site (our logs producers gather data every 5 > minutes, on average 100 items of about 128 bytes per machines, and we're > targeting from 250 to 4000 machines per "site"). > > I think that, with these numbers, we have a way make IT people understand > that Kafka solution will avoid flooding the site's firewall infrastructure > (which is active for outbound connections). > Beyond this good point for Kafka in terms of # of concurrent connections, I > am wondering if we could find other assets for Kafka solution... > > Jean > > -----Original Message----- > From: Joe Stein [mailto:[EMAIL PROTECTED]] > Sent: Sunday, October 21, 2012 1:26 AM > To: [EMAIL PROTECTED] > Subject: Re: Kafka versus classic central HTTP(s) services for logs > transmission > > You could move the producer code to the "site" and expose that as a REST > interface. > > You can then benefit from the scale and consumer functionality that comes > with Kafka without these issues you are bringing up. > > On Oct 20, 2012, at 4:27 PM, Jean Bic <[EMAIL PROTECTED]> wrote: > > > Hello, > > > > We have started to build a solution to gather logs from many machines > > located in various “sites” to a so-called “Consolidation server” which > role > > is to persists the logs and generate alerts based on some criteria > > (patterns in logs, triggers on some values, etc). > > > > > > We are challenged by our future users to clarify why Kafka is for this > need > > the best possible communication solution. They argue that it would be > > better to choose a more classic HTTP(S) based solution with producers > > calling REST services on a pool of Node.js servers behind a > load-balancer. > > > > > > One of the main issue they see with Kafka is that It requires > connections > > from Consolidation Server to Kafka brokers and to Zookeeper daemons > located > > in each “site”, versus connections from logs producers in all sites to > the > > Consolidation servers. > > Here Kafka is seen as a burden for each site’s IT team requiring some > > firewall special setup, versus. no firewall setup with the service-based > > solution : > > > > 1. Kafka requires for each site IT team to create firewall rules for > > accepting incoming connections for a “non standard” protocol from the > > “Collector server” site > > > > 2. IT team must expose all Zookeeper and Broker machines/ports to > the > > “Collector server” site > > > > 3. Kafka has no built-in encryption for data, where as a classic > services > > oriented solution can rely on HTTPS (reverse) proxies > > > > 4. Kafka is not commonly known by IT people who do not know how to
-
RE: Kafka versus classic central HTTP(s) services for logs transmissionSybrandy, Casey 2012-10-22, 12:17
With regards to security, you can always use stunnel to handle the encryption.
-----Original Message----- From: Jun Rao [mailto:[EMAIL PROTECTED]] Sent: Sunday, October 21, 2012 5:45 PM To: [EMAIL PROTECTED] Subject: Re: Kafka versus classic central HTTP(s) services for logs transmission Jean, I understand your IT guys' concerns. It's true that Kafka is relatively new and is not as widely adopted as some other conventional solutions. The following are what I see as the main benefits of Kafka: a. Scalability: The system is designed to scale out. b. Throughput: Kafka supports batch API and compression, which increase the throughput of both producers and consumers. c. Integration for both offline and near line consumption: With Kafka, you can use a single system to load data into an offline system such as Hadoop as well as to consume the data in real time. d. Durability and availability: In the upcoming 0.8 release, Kafka will support intra-cluster replication, which provides both higher durability and availability at low cost. For your concern #2, in 0.8, the producer doesn't need Zookeeper any more. Instead, if relies on an RPC to get topic metadata from the brokers. We haven't looked into security related features. However, if this is a common requirement, we can add them in the future. Hope this is helpful. Thanks, Jun On Sun, Oct 21, 2012 at 1:44 AM, Jean Bic <[EMAIL PROTECTED]> wrote: > Joe: > > Thanks for you answer, but we're trying to push Kafka Broker at each > site... > ... so your answer makes me realize why we're trying to push Kafka > over per-producers services call: that would make a very large number > of services call from each site (our logs producers gather data every > 5 minutes, on average 100 items of about 128 bytes per machines, and > we're targeting from 250 to 4000 machines per "site"). > > I think that, with these numbers, we have a way make IT people > understand that Kafka solution will avoid flooding the site's firewall > infrastructure (which is active for outbound connections). > Beyond this good point for Kafka in terms of # of concurrent > connections, I am wondering if we could find other assets for Kafka solution... > > Jean > > -----Original Message----- > From: Joe Stein [mailto:[EMAIL PROTECTED]] > Sent: Sunday, October 21, 2012 1:26 AM > To: [EMAIL PROTECTED] > Subject: Re: Kafka versus classic central HTTP(s) services for logs > transmission > > You could move the producer code to the "site" and expose that as a > REST interface. > > You can then benefit from the scale and consumer functionality that > comes with Kafka without these issues you are bringing up. > > On Oct 20, 2012, at 4:27 PM, Jean Bic <[EMAIL PROTECTED]> wrote: > > > Hello, > > > > We have started to build a solution to gather logs from many > > machines located in various "sites" to a so-called "Consolidation > > server" which > role > > is to persists the logs and generate alerts based on some criteria > > (patterns in logs, triggers on some values, etc). > > > > > > We are challenged by our future users to clarify why Kafka is for > > this > need > > the best possible communication solution. They argue that it would > > be better to choose a more classic HTTP(S) based solution with > > producers calling REST services on a pool of Node.js servers behind > > a > load-balancer. > > > > > > One of the main issue they see with Kafka is that It requires > connections > > from Consolidation Server to Kafka brokers and to Zookeeper daemons > located > > in each "site", versus connections from logs producers in all sites > > to > the > > Consolidation servers. > > Here Kafka is seen as a burden for each site's IT team requiring > > some firewall special setup, versus. no firewall setup with the > > service-based solution : > > > > 1. Kafka requires for each site IT team to create firewall rules for > > accepting incoming connections for a "non standard" protocol from
-
Re: Kafka versus classic central HTTP(s) services for logs transmissionNeha Narkhede 2012-10-22, 17:49
>> One of the main issue they see with Kafka is that It requires connections from Consolidation Server to Kafka brokers and to Zookeeper daemons located in each “site”, versus connections from logs producers in all sites to the
>> Consolidation servers. When you say "site", do you mean data center ? If yes, then Kafka would be ideal since Kafka provides the ability to set up a cluster that can replicate data from several other clusters located in different data centers. Kafka has compression and batching features built in that can optimally use limited cross DC bandwidth. If you go down this route, you can set up a local Kafka and Zookeeper cluster each "site". Each "site" will have the producers send data to the local Kafka cluster. The Kafka cluster in the "site" hosting the consolidation servers will replicate data from every other "site". The consolidation servers then act as Kafka consumers pulling data from the local Kafka cluster and performing aggregate analysis for every site's data. The advantage of this solution over having the producers directly talk to the consolidation servers is essentially decoupling between producers and consumers. If the consumers can't keep up with the producers, the decoupling provides a persistent buffer that prevents your queues from overflowing and protects the consumers from being overloaded. Kafka, being horizontally scalable, allows you to scale out if the throughput requirements increase in the future. This might not be easy to do at the consolidation servers. In addition to this, the advantage of Kafka is that you can consume the same data multiple times as you find more applications in the future wanting to perform different analysis on the log data. One example of this is offline analytics using Hive/Pig. This is a huge win over the other solution that requires you to store multiple copies of the same data, which increases linearly with the number of consumer applications, essentially making it a very expensive solution. Thanks, Neha On Mon, Oct 22, 2012 at 5:17 AM, Sybrandy, Casey <[EMAIL PROTECTED]> wrote: > With regards to security, you can always use stunnel to handle the encryption. > > -----Original Message----- > From: Jun Rao [mailto:[EMAIL PROTECTED]] > Sent: Sunday, October 21, 2012 5:45 PM > To: [EMAIL PROTECTED] > Subject: Re: Kafka versus classic central HTTP(s) services for logs transmission > > Jean, > > I understand your IT guys' concerns. It's true that Kafka is relatively new and is not as widely adopted as some other conventional solutions. The following are what I see as the main benefits of Kafka: > > a. Scalability: The system is designed to scale out. > b. Throughput: Kafka supports batch API and compression, which increase the throughput of both producers and consumers. > c. Integration for both offline and near line consumption: With Kafka, you can use a single system to load data into an offline system such as Hadoop as well as to consume the data in real time. > d. Durability and availability: In the upcoming 0.8 release, Kafka will support intra-cluster replication, which provides both higher durability and availability at low cost. > > For your concern #2, in 0.8, the producer doesn't need Zookeeper any more. > Instead, if relies on an RPC to get topic metadata from the brokers. > > We haven't looked into security related features. However, if this is a common requirement, we can add them in the future. > > Hope this is helpful. > > Thanks, > > Jun > > > On Sun, Oct 21, 2012 at 1:44 AM, Jean Bic <[EMAIL PROTECTED]> wrote: > >> Joe: >> >> Thanks for you answer, but we're trying to push Kafka Broker at each >> site... >> ... so your answer makes me realize why we're trying to push Kafka >> over per-producers services call: that would make a very large number >> of services call from each site (our logs producers gather data every >> 5 minutes, on average 100 items of about 128 bytes per machines, and >> we're targeting from 250 to 4000 machines per "site").
-
Re: Kafka versus classic central HTTP(s) services for logs transmissionJean Bic 2012-10-22, 20:12
Neha:
Thank you very much for you comprehensive answer. What I call "sites" can be either a data center or a VSA (Virtual Software Appliance) at customer's site. While you arguments are fine for data centers, I'm afraid we still have harder time for producers embedded in VSA, because we'll have to face IT teams of our customers. To be more precise, there would be producers in the various VMs in the VSA and a pre-installed pair of brokers in the VSA plus say 3 zookeeper small VMs. I'm wondering if STunnel would be a good way to limit our customers' IT team critics. Is there any project to make a HTTPS "driver" for consumers, which could "enter" into a reverse proxy and this only require one endpoint per "site" ? Would such architecture make sense? Thanks, Jean 2012/10/22 Neha Narkhede <[EMAIL PROTECTED]> > >> One of the main issue they see with Kafka is that It requires > connections from Consolidation Server to Kafka brokers and to Zookeeper > daemons located in each “site”, versus connections from logs producers in > all sites to the > >> Consolidation servers. > > When you say "site", do you mean data center ? > > If yes, then Kafka would be ideal since Kafka provides the ability to > set up a cluster that can replicate data from several other clusters > located in different data centers. Kafka has compression and batching > features built in that can optimally use limited cross DC bandwidth. > If you go down this route, you can set up a local Kafka and Zookeeper > cluster each "site". Each "site" will have the producers send data to > the local Kafka cluster. The Kafka cluster in the "site" hosting the > consolidation servers will replicate data from every other "site". The > consolidation servers then act as Kafka consumers pulling data from > the local Kafka cluster and performing aggregate analysis for every > site's data. > > The advantage of this solution over having the producers directly talk > to the consolidation servers is essentially decoupling between > producers and consumers. If the consumers can't keep up with the > producers, the decoupling provides a persistent buffer that prevents > your queues from overflowing and protects the consumers from being > overloaded. Kafka, being horizontally scalable, allows you to scale > out if the throughput requirements increase in the future. This might > not be easy to do at the consolidation servers. In addition to this, > the advantage of Kafka is that you can consume the same data multiple > times as you find more applications in the future wanting to perform > different analysis on the log data. One example of this is offline > analytics using Hive/Pig. This is a huge win over the other solution > that requires you to store multiple copies of the same data, which > increases linearly with the number of consumer applications, > essentially making it a very expensive solution. > > Thanks, > Neha > > On Mon, Oct 22, 2012 at 5:17 AM, Sybrandy, Casey > <[EMAIL PROTECTED]> wrote: > > With regards to security, you can always use stunnel to handle the > encryption. > > > > -----Original Message----- > > From: Jun Rao [mailto:[EMAIL PROTECTED]] > > Sent: Sunday, October 21, 2012 5:45 PM > > To: [EMAIL PROTECTED] > > Subject: Re: Kafka versus classic central HTTP(s) services for logs > transmission > > > > Jean, > > > > I understand your IT guys' concerns. It's true that Kafka is relatively > new and is not as widely adopted as some other conventional solutions. The > following are what I see as the main benefits of Kafka: > > > > a. Scalability: The system is designed to scale out. > > b. Throughput: Kafka supports batch API and compression, which increase > the throughput of both producers and consumers. > > c. Integration for both offline and near line consumption: With Kafka, > you can use a single system to load data into an offline system such as > Hadoop as well as to consume the data in real time. > > d. Durability and availability: In the upcoming 0.8 release, Kafka will |