Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Our scenario and couple of questions


Copy link to this message
-
Our scenario and couple of questions
Hi,

Hi everyone*,

Our current situtation (without kafka)*

- we have at the moment 8 event tracker servers that in total are capable
of handling 8000 http events / second but a normal day peak throughput is
about 1250 messages / second.
- messages are basically http events enriched by various apache mods and
trasnformations eventually written into log files
- each event is cca 0.5kb when packed as json
- these message logs are compressed and every 5 minutes shipped into S3
where they are used by hive and other hadoop jobs
- pretty standard
*
My plan is to introduce a kafka system on top the existing offline
log-processing. *

I have a simulated event stream and have written a hadoop job similar to
the etl consumer in the trunk except i keep the offsets in the zookeeper
and the output files are partitioned by date directory.
In the first phase I am going to install kafka broker on each of the 8
tracker servers and simply tail | php producer.php on each of the 8 tracker
servers and then have a PHP code publishing into a local broker node under
a single topic, so in total there will be a cluster of 8 kafka server with
a 3 or 5 zookeeper ensemble interlaced on the same hardware. This topic is
going to be mirrored into a central kafka cluster where the hadoop-loader
job will run every 30 min or so.

*Question 1*: If each broker has one topic and one partition, if i want to
implement a partitioned producer (in php), I still have 8 partitions in
total, correct ?
*Question 2*: In future I may have mutliple event tracking clusters which I
want mirrored onto a single topic in the central trucker, is this kind of
mirroring possible with 0.7.x ?
*Question 3*: If i want the low-level php producer to batch & zip 10
messages like the async scala/java producer does, all i have to do is to
send a message that is a message set containing all the 10 messages,
correct ?
*Question 4*: This system is quite likely to go into production in next
weeks, and I prefer staying with 0.7.x because it's simpler for non-java
clients but would you advice me to build on 0.8.x and why ?
Thanks a lot
--
Michal Haris
Software Engineer

VisualDNA | 7 Moor Street, London, W1D 5NB
www.visualdna.com | t: +44 (0) 207 734 7033
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB