Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka >> mail # user >> Our scenario and couple of questions


+
Michal Haris 2012-10-16, 17:34
+
Neha Narkhede 2012-10-16, 17:45
+
Michal Haris 2012-10-17, 09:39
Copy link to this message
-
Re: Our scenario and couple of questions
Oh, one more aspect of the problem: The event stream can be potentially
split into multiple topics and I have an idea how and with what
partitioning but since the mirroring doesn't obey the partitioning nor
supports partitioner implementation I have a dilemma. Note that there will
be other topics besides this event stream in the entire system but for now
only this one is relevant:

   - Is it fine to have a single topic and then have consumers process
   pointlessly many messages only to find the few they are interested in?
   - Or would it make more sense to have one topic for the sake of
   mirroring and then have a consumer and producer that republishes those
   messages into multiple sub-topics where messages would appear redundantly
   in several topics each with a different parititioner ?

thanks for your help,

Michal,
On 17 October 2012 10:39, Michal Haris <[EMAIL PROTECTED]> wrote:

> Great, thanks a lot!
>
>
> On 16 October 2012 18:45, Neha Narkhede <[EMAIL PROTECTED]> wrote:
>
>> >> *Question 1*: If each broker has one topic and one partition, if i
>> want to
>> implement a partitioned producer (in php), I still have 8 partitions in
>> total, correct ?
>>
>> Correct
>>
>> >> *Question 2*: In future I may have mutliple event tracking clusters
>> which I
>> want mirrored onto a single topic in the central trucker, is this kind of
>> mirroring possible with 0.7.x ?
>>
>> This is available in 0.7.1 onwards
>>
>> >> *Question 3*: If i want the low-level php producer to batch & zip 10
>> messages like the async scala/java producer does, all i have to do is to
>> send a message that is a message set containing all the 10 messages,
>> correct ?
>>
>> Yes, provided you conform with the format of a compressed message -
>> https://cwiki.apache.org/confluence/display/KAFKA/Compression
>>
>> >> *Question 4*: This system is quite likely to go into production in next
>> weeks, and I prefer staying with 0.7.x because it's simpler for non-java
>> clients but would you advice me to build on 0.8.x and why ?
>>
>> Recommend staying on 0.7.x since it is stable. If your requirements
>> include message replication, durability and guaranteed delivery,
>> you might want to wait until 0.8 is released. The wire protocol has
>> changed considerably in 0.8.
>>
>> Thanks,
>> Neha
>>
>> On Tue, Oct 16, 2012 at 10:34 AM, Michal Haris
>> <[EMAIL PROTECTED]> wrote:
>> > Hi,
>> >
>> > Hi everyone*,
>> >
>> > Our current situtation (without kafka)*
>> >
>> > - we have at the moment 8 event tracker servers that in total are
>> capable
>> > of handling 8000 http events / second but a normal day peak throughput
>> is
>> > about 1250 messages / second.
>> > - messages are basically http events enriched by various apache mods and
>> > trasnformations eventually written into log files
>> > - each event is cca 0.5kb when packed as json
>> > - these message logs are compressed and every 5 minutes shipped into S3
>> > where they are used by hive and other hadoop jobs
>> > - pretty standard
>> > *
>> > My plan is to introduce a kafka system on top the existing offline
>> > log-processing. *
>> >
>> > I have a simulated event stream and have written a hadoop job similar to
>> > the etl consumer in the trunk except i keep the offsets in the zookeeper
>> > and the output files are partitioned by date directory.
>> > In the first phase I am going to install kafka broker on each of the 8
>> > tracker servers and simply tail | php producer.php on each of the 8
>> tracker
>> > servers and then have a PHP code publishing into a local broker node
>> under
>> > a single topic, so in total there will be a cluster of 8 kafka server
>> with
>> > a 3 or 5 zookeeper ensemble interlaced on the same hardware. This topic
>> is
>> > going to be mirrored into a central kafka cluster where the
>> hadoop-loader
>> > job will run every 30 min or so.
>> >
>> > *Question 1*: If each broker has one topic and one partition, if i want
>> to
>> > implement a partitioned producer (in php), I still have 8 partitions in
Michal Haris
Software Engineer

VisualDNA | 7 Moor Street, London, W1D 5NB
www.visualdna.com | t: +44 (0) 207 734 7033