I'm seeing some odd behavior while establishing zookeeper sessions, most
often exhibited when Jenkins runs through a large batch of unit tests based
around data going in and out of zookeeper. On occasion (and usually in a
burst) I'll see a large number of unit tests fail due to Session timeouts.
I haven't been able to pin down a root cause for these long connection
times, so I'm looking for some advice and suggestions going forward.
I'm using the python zookeeper client bindings with a thin wrapper around
it to make the zookeeper client look and feel a bit more like an Object (a
stripped down version of this is linked at the end).
The basic gist of the Object is to connect() and then timeout after about
150ms if no session established event is thrown. This code is meant to run
under Flask with HTTP request timeouts at 200ms, so I need to ensure that I
get a connection to zookeeper within that timeframe or return a reasonable
error message back to the caller. Unfortunately, I'm seeing session
establish times as fast as 3ms and as slow as 370ms (see fastest slowest
times linked at end, along with raw data output), which is frequently
causing unit test failures and has occasionally plagued us outside of test
code as well.
I've experimented with using eventlet pools to create a pool of good
sessions at program start, but the second time I ran the tests with this
setup the _very_first_session_ timed out on me.
I see no obvious culprits in the logs, and I know for a fact that these
delayed sessions eventually _do_ establish. They just rarely take an order
of magnitude more time doing so. I'm suspicious that these delays are a
result of the jvm garbage collecting, but that wouldn't reasonably explain
the first session timeout instance I ran into.
The test script I used to generate these results is linked at the end.
You'll need the python-zookeeper and python-eventlet python modules
available and a zookeeper instance to hit.
My test system (virt):
My zookeeper config is the stock Ubuntu config for zookeeper with the
client connection count upped to 200. I'm using zookeeper
version 3.3.3+dfsg2-1ubuntu1. Java version is
java version "1.6.0_23"
OpenJDK Runtime Environment (IcedTea6 1.11pre) (6b23~pre11-0ubuntu1.11.10)
OpenJDK 64-Bit Server VM (build 20.0-b11, mixed mode)
Test Script with Session object: http://pastebin.com/yxhN0Apk
Sample run of Test Script: http://pastebin.com/LJ1AA1jK
Fastest 15 and slowest 15 session establishments:
Raw times as a python json.dump(): http://pastebin.com/2sV65VSG
Thanks in advance for any help and advice.