|
|
-
Zookeeper C client I/O thread delay in poll system call
Tabrez Memon 2010-12-09, 07:42
Hi
I am observing the following error in my app based on the C zookeeper client. The app is completely idle at all times and it has just created a connection with the zookeeper server running on the same machine with a session timeout of 5000ms.
2010-12-04 10:14:45,593:8710(0x7f641136d910):ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 25317ms 2010-12-04 10:14:45,593:8710(0x7f641136d910):ZOO_ERROR@handle_socket_error_msg@1528: Socket [127.0.0.1:9876] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 23651ms)
On adding some debug logs, I found that the delay happens in the poll() system call used in the zookeeper I/O thread (mt_zookeeper.c). I am using a session timeout of 5 seconds. Thus poll gets called with a timeout of ~1666 ms. However after running the app for a while, poll gets stuck for a duration much higher than the specified timeout. Here are the logs which illustrate this problem:
2010-12-04 23:45:50,487:3813(0x7f356fc54910):ZOO_DEBUG@do_io@303: called zk interest 2010-12-04 23:45:50,487:3813(0x7f356fc54910):ZOO_DEBUG@do_io@311: calling poll with timeout 1666ms 2010-12-04 23:46:16,619:3813(0x7f356fc54910):ZOO_DEBUG@do_io@313: out of poll 2010-12-04 23:46:16,619:3813(0x7f356fc54910):ZOO_DEBUG@do_io@325: calling zookeeper_process 2010-12-04 23:46:16,619:3813(0x7f356fc54910):ZOO_DEBUG@do_io@327: out of zookeeper_process 2010-12-04 23:46:16,619:3813(0x7f356fc54910):ZOO_DEBUG@do_io@301: calling zk interest
As you can see the poll call returned after ~26 seconds while it was passed a timeout of 1.6s.
To see if this problem was related to poll() I replaced it with select calls. The app ran for a much longer time after this change., however I eventually hit the same problem again:
2010-12-08 10:59:13,718:20624(0x7f74abe18910):ZOO_DEBUG@zookeeper_interest@1468: Time since last receive: 0ms 2010-12-08 10:59:13,718:20624(0x7f74abe18910):ZOO_DEBUG@do_io@306: calling select with timeout 1666ms 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_DEBUG@do_io@308: out of select 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_DEBUG@do_io@310: select returned with 0 ready fds 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_DEBUG@do_io@327: calling zookeeper_process 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_DEBUG@do_io@329: out of zookeeper_process 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_WARN@zookeeper_interest@1465: Exceeded deadline by 2682ms 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_DEBUG@zookeeper_interest@1468: Time since last receive: 4348ms 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_WARN@zookeeper_interest@1472: Nothing received from server since last: 4348ms 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_ERROR@handle_socket_error_msg@1540: Socket [127.0.0.1:9876] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 1015ms)
In the above run, select was stuck for ~5 seconds inspite of passing a timeout of 1.6s.
I am running this on Ubuntu 10.10 with kernel version 2.6.35. Has anyone else observed anything like this? I was looking around for any posts related to issues with poll/select on linux, but havent found anything.
Thanks for your help! Regards, Tabrez
-
Re: Zookeeper C client I/O thread delay in poll system call
Henry Robinson 2010-12-20, 12:57
Hi Tabrez -
Just wondering - are the number of file descriptors that are being polled going up rapidly? Are you able to capture that data?
Henry
On 8 December 2010 23:42, Tabrez Memon <[EMAIL PROTECTED]> wrote:
> Hi > > I am observing the following error in my app based on the C zookeeper > client. > The app is completely idle at all times and it has just created a > connection > with the zookeeper server running on the same machine with a session > timeout of > 5000ms. > > 2010-12-04 10:14:45,593:8710(0x7f641136d910):ZOO_WARN@zookeeper_interest > @1461: > Exceeded deadline by 25317ms > 2010-12-04 > 10:14:45,593:8710(0x7f641136d910):ZOO_ERROR@handle_socket_error_msg@1528: > Socket > [127.0.0.1:9876] zk retcode=-7, errno=110(Connection timed out): > connection > timed out (exceeded timeout by 23651ms) > > On adding some debug logs, I found that the delay happens in the poll() > system > call used in the zookeeper I/O thread (mt_zookeeper.c). I am using a > session > timeout of 5 seconds. Thus poll gets called with a timeout of ~1666 ms. > However > after running the app for a while, poll gets stuck for a duration much > higher > than the specified timeout. Here are the logs which illustrate this > problem: > > 2010-12-04 23:45:50,487:3813(0x7f356fc54910):ZOO_DEBUG@do_io@303: called > zk > interest > 2010-12-04 23:45:50,487:3813(0x7f356fc54910):ZOO_DEBUG@do_io@311: calling > poll > with timeout 1666ms > 2010-12-04 23:46:16,619:3813(0x7f356fc54910):ZOO_DEBUG@do_io@313: out of > poll > 2010-12-04 23:46:16,619:3813(0x7f356fc54910):ZOO_DEBUG@do_io@325: calling > zookeeper_process > 2010-12-04 23:46:16,619:3813(0x7f356fc54910):ZOO_DEBUG@do_io@327: out of > zookeeper_process > 2010-12-04 23:46:16,619:3813(0x7f356fc54910):ZOO_DEBUG@do_io@301: calling > zk > interest > > As you can see the poll call returned after ~26 seconds while it was passed > a > timeout of 1.6s. > > To see if this problem was related to poll() I replaced it with select > calls. > The app ran for a much longer time after this change., however I eventually > hit > the same problem again: > > 2010-12-08 10:59:13,718:20624(0x7f74abe18910):ZOO_DEBUG@zookeeper_interest > @1468: > Time since last receive: 0ms > 2010-12-08 10:59:13,718:20624(0x7f74abe18910):ZOO_DEBUG@do_io@306: calling > select with timeout 1666ms > 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_DEBUG@do_io@308: out of > select > 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_DEBUG@do_io@310: select > returned with 0 ready fds > 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_DEBUG@do_io@327: calling > zookeeper_process > 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_DEBUG@do_io@329: out of > zookeeper_process > 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_WARN@zookeeper_interest > @1465: > Exceeded deadline by 2682ms > 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_DEBUG@zookeeper_interest > @1468: > Time since last receive: 4348ms > 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_WARN@zookeeper_interest > @1472: > Nothing received from server since last: 4348ms > 2010-12-08 > 10:59:18,065:20624(0x7f74abe18910):ZOO_ERROR@handle_socket_error_msg@1540: > Socket [127.0.0.1:9876] zk retcode=-7, errno=110(Connection timed out): > connection timed out (exceeded timeout by 1015ms) > > In the above run, select was stuck for ~5 seconds inspite of passing a > timeout > of 1.6s. > > I am running this on Ubuntu 10.10 with kernel version 2.6.35. Has anyone > else > observed anything like this? I was looking around for any posts related to > issues with poll/select on linux, but havent found anything. > > Thanks for your help! > Regards, > Tabrez > > > -- Henry Robinson Software Engineer Cloudera 415-994-6679
-
Re: Zookeeper C client I/O thread delay in poll system call
Mahadev Konar 2010-12-21, 01:19
Also, Which OS is this running on?
Thanks mahadev On 12/20/10 4:57 AM, "Henry Robinson" <[EMAIL PROTECTED]> wrote:
> Hi Tabrez - > > Just wondering - are the number of file descriptors that are being polled > going up rapidly? Are you able to capture that data? > > Henry > > On 8 December 2010 23:42, Tabrez Memon <[EMAIL PROTECTED]> wrote: > >> Hi >> >> I am observing the following error in my app based on the C zookeeper >> client. >> The app is completely idle at all times and it has just created a >> connection >> with the zookeeper server running on the same machine with a session >> timeout of >> 5000ms. >> >> 2010-12-04 10:14:45,593:8710(0x7f641136d910):ZOO_WARN@zookeeper_interest >> @1461: >> Exceeded deadline by 25317ms >> 2010-12-04 >> 10:14:45,593:8710(0x7f641136d910):ZOO_ERROR@handle_socket_error_msg@1528: >> Socket >> [127.0.0.1:9876] zk retcode=-7, errno=110(Connection timed out): >> connection >> timed out (exceeded timeout by 23651ms) >> >> On adding some debug logs, I found that the delay happens in the poll() >> system >> call used in the zookeeper I/O thread (mt_zookeeper.c). I am using a >> session >> timeout of 5 seconds. Thus poll gets called with a timeout of ~1666 ms. >> However >> after running the app for a while, poll gets stuck for a duration much >> higher >> than the specified timeout. Here are the logs which illustrate this >> problem: >> >> 2010-12-04 23:45:50,487:3813(0x7f356fc54910):ZOO_DEBUG@do_io@303: called >> zk >> interest >> 2010-12-04 23:45:50,487:3813(0x7f356fc54910):ZOO_DEBUG@do_io@311: calling >> poll >> with timeout 1666ms >> 2010-12-04 23:46:16,619:3813(0x7f356fc54910):ZOO_DEBUG@do_io@313: out of >> poll >> 2010-12-04 23:46:16,619:3813(0x7f356fc54910):ZOO_DEBUG@do_io@325: calling >> zookeeper_process >> 2010-12-04 23:46:16,619:3813(0x7f356fc54910):ZOO_DEBUG@do_io@327: out of >> zookeeper_process >> 2010-12-04 23:46:16,619:3813(0x7f356fc54910):ZOO_DEBUG@do_io@301: calling >> zk >> interest >> >> As you can see the poll call returned after ~26 seconds while it was passed >> a >> timeout of 1.6s. >> >> To see if this problem was related to poll() I replaced it with select >> calls. >> The app ran for a much longer time after this change., however I eventually >> hit >> the same problem again: >> >> 2010-12-08 10:59:13,718:20624(0x7f74abe18910):ZOO_DEBUG@zookeeper_interest >> @1468: >> Time since last receive: 0ms >> 2010-12-08 10:59:13,718:20624(0x7f74abe18910):ZOO_DEBUG@do_io@306: calling >> select with timeout 1666ms >> 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_DEBUG@do_io@308: out of >> select >> 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_DEBUG@do_io@310: select >> returned with 0 ready fds >> 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_DEBUG@do_io@327: calling >> zookeeper_process >> 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_DEBUG@do_io@329: out of >> zookeeper_process >> 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_WARN@zookeeper_interest >> @1465: >> Exceeded deadline by 2682ms >> 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_DEBUG@zookeeper_interest >> @1468: >> Time since last receive: 4348ms >> 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_WARN@zookeeper_interest >> @1472: >> Nothing received from server since last: 4348ms >> 2010-12-08 >> 10:59:18,065:20624(0x7f74abe18910):ZOO_ERROR@handle_socket_error_msg@1540: >> Socket [127.0.0.1:9876] zk retcode=-7, errno=110(Connection timed out): >> connection timed out (exceeded timeout by 1015ms) >> >> In the above run, select was stuck for ~5 seconds inspite of passing a >> timeout >> of 1.6s. >> >> I am running this on Ubuntu 10.10 with kernel version 2.6.35. Has anyone >> else >> observed anything like this? I was looking around for any posts related to >> issues with poll/select on linux, but havent found anything. >> >> Thanks for your help! >> Regards, >> Tabrez >> >> >> > > > > > -- > Henry Robinson > Software Engineer > Cloudera > 415-994-6679 >
-
Re: Zookeeper C client I/O thread delay in poll system call
Tabrez Memon 2010-12-22, 19:58
Hi Henry, Mahadev
I am running this on Ubuntu 10.10 with kernel version 2.6.35. Also this is a guest virtual machine running via VMPlayer on an Ubuntu 10.04 host machine.
> Just wondering - are the number of file descriptors that are being polled > going up rapidly? Are you able to capture that data? No, There are only 2 fds being polled, the self pipe and the client socket fd thats connected to the zookeeper server.
The problem escalates when the machine is loaded (high cpu/IO utilization). select takes much longer than than the specified timeout and this eventually leads to the zookeeper client session timing out.
Regards, Tabrez
________________________________ From: Mahadev Konar <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Tue, 21 December, 2010 6:49:06 Subject: Re: Zookeeper C client I/O thread delay in poll system call
Also, Which OS is this running on?
Thanks mahadev On 12/20/10 4:57 AM, "Henry Robinson" <[EMAIL PROTECTED]> wrote:
> Hi Tabrez - > > Just wondering - are the number of file descriptors that are being polled > going up rapidly? Are you able to capture that data? > > Henry > > On 8 December 2010 23:42, Tabrez Memon <[EMAIL PROTECTED]> wrote: > >> Hi >> >> I am observing the following error in my app based on the C zookeeper >> client. >> The app is completely idle at all times and it has just created a >> connection >> with the zookeeper server running on the same machine with a session >> timeout of >> 5000ms. >> >> 2010-12-04 10:14:45,593:8710(0x7f641136d910):ZOO_WARN@zookeeper_interest >> @1461: >> Exceeded deadline by 25317ms >> 2010-12-04 >> 10:14:45,593:8710(0x7f641136d910):ZOO_ERROR@handle_socket_error_msg@1528: >> Socket >> [127.0.0.1:9876] zk retcode=-7, errno=110(Connection timed out): >> connection >> timed out (exceeded timeout by 23651ms) >> >> On adding some debug logs, I found that the delay happens in the poll() >> system >> call used in the zookeeper I/O thread (mt_zookeeper.c). I am using a >> session >> timeout of 5 seconds. Thus poll gets called with a timeout of ~1666 ms. >> However >> after running the app for a while, poll gets stuck for a duration much >> higher >> than the specified timeout. Here are the logs which illustrate this >> problem: >> >> 2010-12-04 23:45:50,487:3813(0x7f356fc54910):ZOO_DEBUG@do_io@303: called >> zk >> interest >> 2010-12-04 23:45:50,487:3813(0x7f356fc54910):ZOO_DEBUG@do_io@311: calling >> poll >> with timeout 1666ms >> 2010-12-04 23:46:16,619:3813(0x7f356fc54910):ZOO_DEBUG@do_io@313: out of >> poll >> 2010-12-04 23:46:16,619:3813(0x7f356fc54910):ZOO_DEBUG@do_io@325: calling >> zookeeper_process >> 2010-12-04 23:46:16,619:3813(0x7f356fc54910):ZOO_DEBUG@do_io@327: out of >> zookeeper_process >> 2010-12-04 23:46:16,619:3813(0x7f356fc54910):ZOO_DEBUG@do_io@301: calling >> zk >> interest >> >> As you can see the poll call returned after ~26 seconds while it was passed >> a >> timeout of 1.6s. >> >> To see if this problem was related to poll() I replaced it with select >> calls. >> The app ran for a much longer time after this change., however I eventually >> hit >> the same problem again: >> >> 2010-12-08 10:59:13,718:20624(0x7f74abe18910):ZOO_DEBUG@zookeeper_interest >> @1468: >> Time since last receive: 0ms >> 2010-12-08 10:59:13,718:20624(0x7f74abe18910):ZOO_DEBUG@do_io@306: calling >> select with timeout 1666ms >> 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_DEBUG@do_io@308: out of >> select >> 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_DEBUG@do_io@310: select >> returned with 0 ready fds >> 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_DEBUG@do_io@327: calling >> zookeeper_process >> 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_DEBUG@do_io@329: out of >> zookeeper_process >> 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_WARN@zookeeper_interest >> @1465: >> Exceeded deadline by 2682ms >> 2010-12-08 10:59:18,065:20624(0x7f74abe18910):ZOO_DEBUG@zookeeper_interest
|
|