Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper >> mail # user >> Zookeeper in Kafka has gets a very low Performance

Copy link to this message
Re: Zookeeper in Kafka has gets a very low Performance
Hi Liwei,

Having more write operations for ZK Cluster A could explain away the
slowness. ZooKeeper does batching of WAL in order to ensure overall
high throughput (sometimes at the expense of latency).

To rule out fsync stalls, can you run this strace command on both
Cluster A and Cluster B?
 sudo strace -r -T -f -p <pid> -e trace=fsync,fdatasync -o trace.txt

Bad ECC memory has been known to cause a single server to run an order
of magnitude slower than the rest - please see the "Hardware - ECC
memory problems can be hard to track down" section of
http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting (good to
review for other insights as well).

Regards, Kathleen

On Thu, Sep 13, 2012 at 7:08 AM, Alexander Shraer <[EMAIL PROTECTED]> wrote:
> Sorry, I meant zk-1355 not 1411
> Sent from mobile
> On Sep 13, 2012, at 6:44 AM, Alexander Shraer <[EMAIL PROTECTED]> wrote:
>> You can try connecting readers and writers to different servers. For example have an observer and connect your readers to it. Then you'll need to externally detect the observer's failure and reconnect them to the other 3 servers (you cant do that currently without changing the session). Or have 5 followers rather than 3 and give readers know about s1 and s2 while writers will know about s3 s4 and s5 or the other way around.
>> If you decide to change connection strings dynamically upon failure you may want to consider zk- 1411 that allows you to do that in the same session.
>> Alex
>> On Sep 12, 2012, at 11:47 PM, sun liwei <[EMAIL PROTECTED]> wrote:
>>> That's may be the root cause.
>>> Here is something new I got:
>>> Three zk servers are named: s1, s2, s3.
>>> I restart zk on s3. After that the performance of this server gets better,
>>> and I find that the number of connections / sessions on this server is
>>> small(no more than 10). At the same time other two server(s1 and s2) both
>>> have many (80+) connections / sessions and have a bad performance.
>>> I guess most of the connections and sessions are doing the write
>>> operations.
>>> Even though there is a path ZOOKEEPER-1505, but it's not included in 3.3.*
>>> branch. Will there be any work around for this issue?
>>> Liwei
>>> On Thu, Sep 13, 2012 at 1:07 PM, Alexander Shraer <[EMAIL PROTECTED]> wrote:
>>>> Maybe it has something to do with the frequent writes delaying reads ?
>>>> If a write is submitted to a follower and then a read (getData),
>>>> no matter if the write and the read are to the same data, the read
>>>> will block until the write completes. That's what ZOOKEEPER-1505 tries
>>>> to address.
>>>> Alex
>>>> On Wed, Sep 12, 2012 at 6:44 PM, sun liwei <[EMAIL PROTECTED]> wrote:
>>>>> Let me give more details:
>>>>> I have two zookeeper clusters: A(3.3.4) and B(3.4.3).
>>>>> A is used by Kafka only. B servers other application rather than Kafka.
>>>>> The performance of A is very bad no matter which path I read with
>>>> getData;
>>>>> But B works well.
>>>>> My suspicion is that Kafka makes A slowing down. There are 1000+
>>>>> consumers(each consumer is a node) in A. Kafka updates the value of all
>>>>> these consumers very frequently. So, that means the write operations in A
>>>>> is much more than read operations.
>>>>> On Wed, Sep 12, 2012 at 6:33 PM, sun liwei <[EMAIL PROTECTED]> wrote:
>>>>>> The version of zookeeper is 3.3.4.
>>>>>> On Wed, Sep 12, 2012 at 6:07 PM, sun liwei <[EMAIL PROTECTED]> wrote:
>>>>>>> Here is what I have:
>>>>>>>  - zookeeper cluster with 3 severs
>>>>>>>  - 1700+ kafka topics (that means there are 1700+ children under path
>>>>>>>  '/brokers/topics'), these topics are written frequently by kafka.
>>>>>>>  - log file size is 67108880 byte,  snapshot file size is 4192072
>>>>>>>  byte, new log/snapshot files are created every one or two minutes
>>>>>>> The problem is that it takes more than 40ms to getData of a single