Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Zookeeper, mail # user - Server seems not to be sending keep-alives so I lose my session ("have not heard from server....")


+
Brian Tarbox 2013-02-25, 16:51
+
Camille Fournier 2013-02-25, 16:56
+
Brian Tarbox 2013-02-25, 18:05
Copy link to this message
-
Re: Server seems not to be sending keep-alives so I lose my session ("have not heard from server....")
Camille Fournier 2013-02-26, 18:10
I'm not sure that having more pings will help you with a lossy network. The
right thing to do here I think is to increase the timeout if you want to be
able to stay connected through a lossy network, not send even more pings.
I'm not a networking expect but TCP should be retrying the send of those
packets long before you reach the client liveness timeout. I would also
just in general question how much you want your ZK clients thinking they
are alive when the network is so lossy that 2 pings are unable to be acked
in a reasonable period. The control needs to be the timeouts in your
network both at the ZK layer and the tcp layer (if you have access to
that), not the number of pings you send.

C
On Mon, Feb 25, 2013 at 1:05 PM, Brian Tarbox <[EMAIL PROTECTED]> wrote:

> The server logs don't say anything.  I do have a theory based on reading
> the code, specifically the SendThread class within ClientCnxn.java
>
> It took me a while to figure that its the client that sends the ping due to
> the error message being "have not heard from the *server *..."
> Once I got past that the key line in the code is:
>
> int timeToNextPing = readTimeout / 2  - clientCnxnSocket.getIdleSend()
>
> This basically means that the client will get at most 2 tries to send the
> ping within the timeout interval, no matter what you set the timeout value
> to.
> In a lossy network this may be insufficient...as can be seen from my client
> logs where I can go 30 seconds without sending a ping.
>
> I'm running a test now where I've changed the "2" to a "4".  I trade a tiny
> increase in network traffic for a much higher chance of getting a
> successful ping even in a bad network environment.
>
> Brian
>
>
> On Mon, Feb 25, 2013 at 11:56 AM, Camille Fournier <[EMAIL PROTECTED]
> >wrote:
>
> > What do your server logs say during this time?
> >
> >
> > On Mon, Feb 25, 2013 at 11:51 AM, Brian Tarbox <[EMAIL PROTECTED]
> > >wrote:
> >
> > > I am getting the dreaded message:
> > >
> > >  10:59:45,871 INFO [org.apache.zookeeper.ClientCnxn] - <Client session
> > > timed out, have not heard from server in 31482ms for sessionid
> > > 0x13d11dd08160007, closing socket connection and attempting reconnect>
> > >
> > > and from looking at the logs it certainly seems that the keep alive
> > > messages are sometime just not being sent.
> > >
> > > In my case I see a bunch of these:
> > > 10:58:00,164 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping
> response
> > > for sessionid: 0x13d11dd08160007 after 0ms>
> > > 10:58:13,511 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping
> response
> > > for sessionid: 0x13d11dd08160007 after 0ms>
> > > 10:58:26,857 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping
> response
> > > for sessionid: 0x13d11dd08160007 after 0ms>
> > > 10:58:40,205 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping
> response
> > > for sessionid: 0x13d11dd08160007 after 0ms>
> > > 10:59:14,140 DEBUG [org.apache.zookeeper.ClientCnxn] - <Got ping
> response
> > > for sessionid: 0x13d11dd08160007 after 0ms>
> > >
> > > But then nothing from 10:59:14 until 10:59:45 when my client decides
> its
> > > been too long and so times out.
> > >
> > > I'm running 3.4.5 on EC2 ...any suggestions welcome.
> > >
> > > Thanks.
> > >
> > > Brian Tarbox
> > > --
> > > http://about.me/BrianTarbox
> > >
> >
>
>
>
> --
> http://about.me/BrianTarbox
>