Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka >> mail # user >> Our scenario and couple of questions


+
Michal Haris 2012-10-16, 17:34
+
Neha Narkhede 2012-10-16, 17:45
+
Michal Haris 2012-10-17, 09:39
Copy link to this message
-
Re: Our scenario and couple of questions
Oh, one more aspect of the problem: The event stream can be potentially
split into multiple topics and I have an idea how and with what
partitioning but since the mirroring doesn't obey the partitioning nor
supports partitioner implementation I have a dilemma. Note that there will
be other topics besides this event stream in the entire system but for now
only this one is relevant:

   - Is it fine to have a single topic and then have consumers process
   pointlessly many messages only to find the few they are interested in?
   - Or would it make more sense to have one topic for the sake of
   mirroring and then have a consumer and producer that republishes those
   messages into multiple sub-topics where messages would appear redundantly
   in several topics each with a different parititioner ?

thanks for your help,

Michal,
On 17 October 2012 10:39, Michal Haris <[EMAIL PROTECTED]> wrote:

> Great, thanks a lot!
>
>
> On 16 October 2012 18:45, Neha Narkhede <[EMAIL PROTECTED]> wrote:
>
>> >> *Question 1*: If each broker has one topic and one partition, if i
>> want to
>> implement a partitioned producer (in php), I still have 8 partitions in
>> total, correct ?
>>
>> Correct
>>
>> >> *Question 2*: In future I may have mutliple event tracking clusters
>> which I
>> want mirrored onto a single topic in the central trucker, is this kind of
>> mirroring possible with 0.7.x ?
>>
>> This is available in 0.7.1 onwards
>>
>> >> *Question 3*: If i want the low-level php producer to batch & zip 10
>> messages like the async scala/java producer does, all i have to do is to
>> send a message that is a message set containing all the 10 messages,
>> correct ?
>>
>> Yes, provided you conform with the format of a compressed message -
>> https://cwiki.apache.org/confluence/display/KAFKA/Compression
>>
>> >> *Question 4*: This system is quite likely to go into production in next
>> weeks, and I prefer staying with 0.7.x because it's simpler for non-java
>> clients but would you advice me to build on 0.8.x and why ?
>>
>> Recommend staying on 0.7.x since it is stable. If your requirements
>> include message replication, durability and guaranteed delivery,
>> you might want to wait until 0.8 is released. The wire protocol has
>> changed considerably in 0.8.
>>
>> Thanks,
>> Neha
>>
>> On Tue, Oct 16, 2012 at 10:34 AM, Michal Haris
>> <[EMAIL PROTECTED]> wrote:
>> > Hi,
>> >
>> > Hi everyone*,
>> >
>> > Our current situtation (without kafka)*
>> >
>> > - we have at the moment 8 event tracker servers that in total are
>> capable
>> > of handling 8000 http events / second but a normal day peak throughput
>> is
>> > about 1250 messages / second.
>> > - messages are basically http events enriched by various apache mods and
>> > trasnformations eventually written into log files
>> > - each event is cca 0.5kb when packed as json
>> > - these message logs are compressed and every 5 minutes shipped into S3
>> > where they are used by hive and other hadoop jobs
>> > - pretty standard
>> > *
>> > My plan is to introduce a kafka system on top the existing offline
>> > log-processing. *
>> >
>> > I have a simulated event stream and have written a hadoop job similar to
>> > the etl consumer in the trunk except i keep the offsets in the zookeeper
>> > and the output files are partitioned by date directory.
>> > In the first phase I am going to install kafka broker on each of the 8
>> > tracker servers and simply tail | php producer.php on each of the 8
>> tracker
>> > servers and then have a PHP code publishing into a local broker node
>> under
>> > a single topic, so in total there will be a cluster of 8 kafka server
>> with
>> > a 3 or 5 zookeeper ensemble interlaced on the same hardware. This topic
>> is
>> > going to be mirrored into a central kafka cluster where the
>> hadoop-loader
>> > job will run every 30 min or so.
>> >
>> > *Question 1*: If each broker has one topic and one partition, if i want
>> to
>> > implement a partitioned producer (in php), I still have 8 partitions in
Michal Haris
Software Engineer

VisualDNA | 7 Moor Street, London, W1D 5NB
www.visualdna.com | t: +44 (0) 207 734 7033
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB