Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Our scenario and couple of questions

Copy link to this message
Our scenario and couple of questions

Hi everyone*,

Our current situtation (without kafka)*

- we have at the moment 8 event tracker servers that in total are capable
of handling 8000 http events / second but a normal day peak throughput is
about 1250 messages / second.
- messages are basically http events enriched by various apache mods and
trasnformations eventually written into log files
- each event is cca 0.5kb when packed as json
- these message logs are compressed and every 5 minutes shipped into S3
where they are used by hive and other hadoop jobs
- pretty standard
My plan is to introduce a kafka system on top the existing offline
log-processing. *

I have a simulated event stream and have written a hadoop job similar to
the etl consumer in the trunk except i keep the offsets in the zookeeper
and the output files are partitioned by date directory.
In the first phase I am going to install kafka broker on each of the 8
tracker servers and simply tail | php producer.php on each of the 8 tracker
servers and then have a PHP code publishing into a local broker node under
a single topic, so in total there will be a cluster of 8 kafka server with
a 3 or 5 zookeeper ensemble interlaced on the same hardware. This topic is
going to be mirrored into a central kafka cluster where the hadoop-loader
job will run every 30 min or so.

*Question 1*: If each broker has one topic and one partition, if i want to
implement a partitioned producer (in php), I still have 8 partitions in
total, correct ?
*Question 2*: In future I may have mutliple event tracking clusters which I
want mirrored onto a single topic in the central trucker, is this kind of
mirroring possible with 0.7.x ?
*Question 3*: If i want the low-level php producer to batch & zip 10
messages like the async scala/java producer does, all i have to do is to
send a message that is a message set containing all the 10 messages,
correct ?
*Question 4*: This system is quite likely to go into production in next
weeks, and I prefer staying with 0.7.x because it's simpler for non-java
clients but would you advice me to build on 0.8.x and why ?
Thanks a lot
Michal Haris
Software Engineer

VisualDNA | 7 Moor Street, London, W1D 5NB
www.visualdna.com | t: +44 (0) 207 734 7033