Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper >> mail # user >> adding a separate thread to detect network timeouts faster


Copy link to this message
-
Re: adding a separate thread to detect network timeouts faster
It's not ms, it's seconds.  The difference between 6 seconds and 30
seconds is very noticeable to any client using a ZK system.

I'd be very interested to hear about existing ways we can use ZK to
achieve faster network failure detection.

Jeremy

On 09/10/2013 01:45 PM, [EMAIL PROTECTED] wrote:
> 5x seems like a lot but what is the functional difference between 5 and 25 ms?
>
> I think there is probably some problem you could solve a different way using the guarantees that zk already makes.
>
> -m
>
> On Sep 10, 2013, at 3:34 PM, Jeremy Stribling <[EMAIL PROTECTED]> wrote:
>
>> I mostly agree, but let's assume that a ~5x speedup in detecting those types of failures is considered significant for some people. Are there technical reasons that would prevent this idea from working?
>>
>> On 09/10/2013 01:31 PM, Ted Dunning wrote:
>>> I don't see the strong value here.  A few failures would be detected more
>>> quickly, but I am not convinced that this would actually improve
>>> functionality significantly.
>>>
>>>
>>> On Tue, Sep 10, 2013 at 1:01 PM, Jeremy Stribling <[EMAIL PROTECTED]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Let's assume that you wanted to deploy ZK in a virtualized environment,
>>>> despite all of the known drawbacks.  Assume we could deploy it such that
>>>> the ZK servers were all using independent CPUs and storage (though not
>>>> dedicated disks).  Obviously, the shared disks (shared with other, non-ZK
>>>> VMs on the same hypervisor) will cause ZK to hit the default session
>>>> timeout occasionally, so you would need to raise the existing session
>>>> timeout to something like 30 seconds.
>>>>
>>>> I'm curious if there would be any technical drawbacks to adding an
>>>> additional heartbeat mechanism between the clients and the servers, which
>>>> would have the goal of detecting network-only failures faster than the
>>>> existing heartbeat mechanism.  The idea is that there would be a new thread
>>>> dedicated to processing these heartbeats, which would not get blocked on
>>>> I/O.  Then the clients could configure a second, smaller timeout value, and
>>>> it would be assumed that any such timeout indicated a real problem.  The
>>>> existing mechanism would still be in place to catch I/O-related errors.
>>>>
>>>> I understand the philosophy that there should be some heartbeat mechanism
>>>> that takes the disk into account, but I'm having trouble coming up with
>>>> technical reasons not to add a second mechanism. Obviously, the advantage
>>>> would be that the clients could detect network failures and system crashes
>>>> more quickly in an environment with slow disks, and fail over to other
>>>> servers more quickly.  The only disadvantages I can come up with are:
>>>>
>>>> 1) More code complexity, and slightly more heartbeat traffic on the wire
>>>> 2) I think the servers have to log session expirations to disk, so if the
>>>> sessions expire at a faster rate than the disk can handle, it might lead to
>>>> a large backlog.
>>>>
>>>> Are there other drawbacks I am missing?  Would a patch that added
>>>> something like this be considered, or is it dead from the start? Thanks,
>>>>
>>>> Jeremy