Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper, mail # user - Zookeeper delay  to reconnect


Copy link to this message
-
Re: Zookeeper delay to reconnect
Sergei Babovich 2012-09-28, 15:15
Thanks, Patrick!
On 09/27/2012 07:55 PM, Patrick Hunt wrote:
> The random sleep was explicitly added to reduce herd effects and
> general "spinning client" problems iirc. Keep in mind that ZK
> generally trades of performance for availability.
That's exactly my concern - it is not about performance - from the
client's point of view having reconnect delay makes cluster effectively
unavailable for up to a second. In a scenarios when you have relatively
low number of sessions (herding is not a concern) with each session
processing a lot of requests such strategy potentially causes
instability - there is no way to gracefully handle intermittent errors
caused by normal operation procedures without risking client's stability.
> It wouldn't be a
> good idea to remove it in general. If anything we should have a more
> aggressive backoff policy in the case where clients are just spinning.
>
> Perhaps a plug-able approach here? Where the default is something like
> what we already have, but allow users to implement their own policy if
> they like. We could have a few implementations "out of the box"; 1)
> current, 2) no wait, 3) exponential backoff after trying each server
> in the ensemble, etc... This would also allow for experimentation.
Totally agree - customizable strategy should be an answer to facilitate
different requirements.
Just curious: does randomized delay make a real difference here? Was it
a real issue somebody hit? I'd expect that randomizing server address to
reconnect should be enough - the load will be evenly distributed across
the rest of the cluster node and should not create a problem assuming
enough zookeeper cluster capacity.
>
> Patrick
>
> On Thu, Sep 27, 2012 at 2:28 PM, Michi Mutsuzaki <[EMAIL PROTECTED]> wrote:
>> Hi Sergei,
>>
>> Your suggestion sounds reasonable to me. I think the sleep was added
>> so that the client doesn't spin when the entire zookeeper is down. The
>> client could try to connect to each server without sleep, and sleep
>> for 1 second only after failing to connect to all the servers in the
>> cluster.
>>
>> Thanks!
>> --Michi
>>
>> On Thu, Sep 27, 2012 at 1:34 PM, Sergei Babovich
>> <[EMAIL PROTECTED]> wrote:
>>> Hi,
>>> Zookeeper implements a delay of up to 1 second before trying to reconnect.
>>>
>>> ClientCnxn$SendThread
>>>          @Override
>>>          public void run() {
>>>              ...
>>>              while (state.isAlive()) {
>>>                  try {
>>>                      if (!clientCnxnSocket.isConnected()) {
>>>                          if(!isFirstConnect){
>>>                              try {
>>>                                  Thread.sleep(r.nextInt(1000));
>>>                              } catch (InterruptedException e) {
>>>                                  LOG.warn("Unexpected exception", e);
>>>                              }
>>>
>>> This creates "outages" (even with simple retry on ConnectionLoss) up to 1s
>>> even with perfectly healthy cluster like in scenario of rolling restart. In
>>> our scenario it might be a problem under high load creating a spike in a
>>> number of requests waiting on zk operation.
>>> Would it be a better strategy to perform reconnect attempt immediately at
>>> least one time? Or there is more to it?

This e-mail message and all attachments transmitted with it may contain privileged and/or confidential information intended solely for the use of the addressee(s). If the reader of this message is not the intended recipient, you are hereby notified that any reading, dissemination, distribution, copying, forwarding or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender immediately and delete this message, all attachments and all copies and backups thereof.