Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper >> mail # user >> Zookeeper delay  to reconnect


Copy link to this message
-
Re: Zookeeper delay to reconnect
Hi Brian, well, in my proposal the default would be the current
behavior. With the discretion of the zk operator to change, so it
shouldn't be any worse.

You've piqued my interest - a single client attempting to connect is
responsible for bringing down the entire cluster? Could you provide
more details?

Patrick

On Thu, Sep 27, 2012 at 4:58 PM, Brian Tarbox <[EMAIL PROTECTED]> wrote:
> I would lobby not to change this...I'm still occasionally dealing with
> clients spinning trying to connect...which brings down the whole cluster
> until that one client is killed.
>
> Brian
>
> On Thu, Sep 27, 2012 at 7:55 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote:
>
>> The random sleep was explicitly added to reduce herd effects and
>> general "spinning client" problems iirc. Keep in mind that ZK
>> generally trades of performance for availability. It wouldn't be a
>> good idea to remove it in general. If anything we should have a more
>> aggressive backoff policy in the case where clients are just spinning.
>>
>> Perhaps a plug-able approach here? Where the default is something like
>> what we already have, but allow users to implement their own policy if
>> they like. We could have a few implementations "out of the box"; 1)
>> current, 2) no wait, 3) exponential backoff after trying each server
>> in the ensemble, etc... This would also allow for experimentation.
>>
>> Patrick
>>
>> On Thu, Sep 27, 2012 at 2:28 PM, Michi Mutsuzaki <[EMAIL PROTECTED]>
>> wrote:
>> > Hi Sergei,
>> >
>> > Your suggestion sounds reasonable to me. I think the sleep was added
>> > so that the client doesn't spin when the entire zookeeper is down. The
>> > client could try to connect to each server without sleep, and sleep
>> > for 1 second only after failing to connect to all the servers in the
>> > cluster.
>> >
>> > Thanks!
>> > --Michi
>> >
>> > On Thu, Sep 27, 2012 at 1:34 PM, Sergei Babovich
>> > <[EMAIL PROTECTED]> wrote:
>> >> Hi,
>> >> Zookeeper implements a delay of up to 1 second before trying to
>> reconnect.
>> >>
>> >> ClientCnxn$SendThread
>> >>         @Override
>> >>         public void run() {
>> >>             ...
>> >>             while (state.isAlive()) {
>> >>                 try {
>> >>                     if (!clientCnxnSocket.isConnected()) {
>> >>                         if(!isFirstConnect){
>> >>                             try {
>> >>                                 Thread.sleep(r.nextInt(1000));
>> >>                             } catch (InterruptedException e) {
>> >>                                 LOG.warn("Unexpected exception", e);
>> >>                             }
>> >>
>> >> This creates "outages" (even with simple retry on ConnectionLoss) up to
>> 1s
>> >> even with perfectly healthy cluster like in scenario of rolling
>> restart. In
>> >> our scenario it might be a problem under high load creating a spike in a
>> >> number of requests waiting on zk operation.
>> >> Would it be a better strategy to perform reconnect attempt immediately
>> at
>> >> least one time? Or there is more to it?
>>
>
>
>
> --
> http://about.me/BrianTarbox
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB