Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Zookeeper >> mail # user >> adding a separate thread to detect network timeouts faster

Jeremy Stribling 2013-09-10, 20:01
Ted Dunning 2013-09-10, 20:31
Copy link to this message
Re: adding a separate thread to detect network timeouts faster
I mostly agree, but let's assume that a ~5x speedup in detecting those
types of failures is considered significant for some people. Are there
technical reasons that would prevent this idea from working?

On 09/10/2013 01:31 PM, Ted Dunning wrote:
> I don't see the strong value here.  A few failures would be detected more
> quickly, but I am not convinced that this would actually improve
> functionality significantly.
> On Tue, Sep 10, 2013 at 1:01 PM, Jeremy Stribling <[EMAIL PROTECTED]> wrote:
>> Hi all,
>> Let's assume that you wanted to deploy ZK in a virtualized environment,
>> despite all of the known drawbacks.  Assume we could deploy it such that
>> the ZK servers were all using independent CPUs and storage (though not
>> dedicated disks).  Obviously, the shared disks (shared with other, non-ZK
>> VMs on the same hypervisor) will cause ZK to hit the default session
>> timeout occasionally, so you would need to raise the existing session
>> timeout to something like 30 seconds.
>> I'm curious if there would be any technical drawbacks to adding an
>> additional heartbeat mechanism between the clients and the servers, which
>> would have the goal of detecting network-only failures faster than the
>> existing heartbeat mechanism.  The idea is that there would be a new thread
>> dedicated to processing these heartbeats, which would not get blocked on
>> I/O.  Then the clients could configure a second, smaller timeout value, and
>> it would be assumed that any such timeout indicated a real problem.  The
>> existing mechanism would still be in place to catch I/O-related errors.
>> I understand the philosophy that there should be some heartbeat mechanism
>> that takes the disk into account, but I'm having trouble coming up with
>> technical reasons not to add a second mechanism. Obviously, the advantage
>> would be that the clients could detect network failures and system crashes
>> more quickly in an environment with slow disks, and fail over to other
>> servers more quickly.  The only disadvantages I can come up with are:
>> 1) More code complexity, and slightly more heartbeat traffic on the wire
>> 2) I think the servers have to log session expirations to disk, so if the
>> sessions expire at a faster rate than the disk can handle, it might lead to
>> a large backlog.
>> Are there other drawbacks I am missing?  Would a patch that added
>> something like this be considered, or is it dead from the start? Thanks,
>> Jeremy
mattdaumen@... 2013-09-10, 20:45
Jeremy Stribling 2013-09-10, 20:48
Ted Dunning 2013-09-10, 20:59
Ted Dunning 2013-09-10, 21:04
Jeremy Stribling 2013-09-10, 21:05
German Blanco 2013-09-11, 05:40
Jeremy Stribling 2013-09-11, 06:32
Michi Mutsuzaki 2013-09-11, 20:36
Rakesh R 2013-09-12, 07:05
Michi Mutsuzaki 2013-09-12, 18:05
Rakesh R 2013-09-13, 06:24