Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Zookeeper, mail # user - Zookeeper delay  to reconnect


+
Sergei Babovich 2012-09-27, 20:34
+
Michi Mutsuzaki 2012-09-27, 21:28
+
Ben Bangert 2012-09-28, 16:34
+
Patrick Hunt 2012-09-27, 23:55
+
Sergei Babovich 2012-09-28, 15:15
+
Brian Tarbox 2012-09-27, 23:58
Copy link to this message
-
Re: Zookeeper delay to reconnect
Patrick Hunt 2012-09-28, 00:07
Hi Brian, well, in my proposal the default would be the current
behavior. With the discretion of the zk operator to change, so it
shouldn't be any worse.

You've piqued my interest - a single client attempting to connect is
responsible for bringing down the entire cluster? Could you provide
more details?

Patrick

On Thu, Sep 27, 2012 at 4:58 PM, Brian Tarbox <[EMAIL PROTECTED]> wrote:
> I would lobby not to change this...I'm still occasionally dealing with
> clients spinning trying to connect...which brings down the whole cluster
> until that one client is killed.
>
> Brian
>
> On Thu, Sep 27, 2012 at 7:55 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote:
>
>> The random sleep was explicitly added to reduce herd effects and
>> general "spinning client" problems iirc. Keep in mind that ZK
>> generally trades of performance for availability. It wouldn't be a
>> good idea to remove it in general. If anything we should have a more
>> aggressive backoff policy in the case where clients are just spinning.
>>
>> Perhaps a plug-able approach here? Where the default is something like
>> what we already have, but allow users to implement their own policy if
>> they like. We could have a few implementations "out of the box"; 1)
>> current, 2) no wait, 3) exponential backoff after trying each server
>> in the ensemble, etc... This would also allow for experimentation.
>>
>> Patrick
>>
>> On Thu, Sep 27, 2012 at 2:28 PM, Michi Mutsuzaki <[EMAIL PROTECTED]>
>> wrote:
>> > Hi Sergei,
>> >
>> > Your suggestion sounds reasonable to me. I think the sleep was added
>> > so that the client doesn't spin when the entire zookeeper is down. The
>> > client could try to connect to each server without sleep, and sleep
>> > for 1 second only after failing to connect to all the servers in the
>> > cluster.
>> >
>> > Thanks!
>> > --Michi
>> >
>> > On Thu, Sep 27, 2012 at 1:34 PM, Sergei Babovich
>> > <[EMAIL PROTECTED]> wrote:
>> >> Hi,
>> >> Zookeeper implements a delay of up to 1 second before trying to
>> reconnect.
>> >>
>> >> ClientCnxn$SendThread
>> >>         @Override
>> >>         public void run() {
>> >>             ...
>> >>             while (state.isAlive()) {
>> >>                 try {
>> >>                     if (!clientCnxnSocket.isConnected()) {
>> >>                         if(!isFirstConnect){
>> >>                             try {
>> >>                                 Thread.sleep(r.nextInt(1000));
>> >>                             } catch (InterruptedException e) {
>> >>                                 LOG.warn("Unexpected exception", e);
>> >>                             }
>> >>
>> >> This creates "outages" (even with simple retry on ConnectionLoss) up to
>> 1s
>> >> even with perfectly healthy cluster like in scenario of rolling
>> restart. In
>> >> our scenario it might be a problem under high load creating a spike in a
>> >> number of requests waiting on zk operation.
>> >> Would it be a better strategy to perform reconnect attempt immediately
>> at
>> >> least one time? Or there is more to it?
>>
>
>
>
> --
> http://about.me/BrianTarbox
+
Brian Tarbox 2012-09-28, 00:21