-Re: odd network failure mode
Benjamin Reed 2011-02-26, 00:26
it would be an interesting experiment to do.
if we are talking client/server communication, the client should switch to
another server as soon as one side tried to send a big packet. the reason
being that we do not heartbeat if there is an outstanding request, so the
connection would timeout and it would try reconnecting to another (hopefully
better configured) system.
if the two machines are servers, the answer isn't as clear. i think the main
"problem" is that the leader election is all made up of small packets, so if
the leader is dropping big packets from followers, it could keep getting
elected, but never sync with followers. we would need to test this case
though to get a definitive answer.
on thing that is throwing me though. since we use TCP there is an initial
negotiation of MSS during the handshake. i would think that TCP's MSS
calculation in this case would case an MSS of the minimum of the two MTUs
being selected, which would avoid the problem altogether.
On Fri, Feb 25, 2011 at 3:52 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> I just saw a novel failure mode the other day that might have interesting
> implications for Zookeeper.
> The problem was that the network MTU was 9000 while one machine's MTU was
> set much smaller. This meant that large incoming packets were dropped but
> all outgoing packets were OK.
> There were other problems with the networking in the real case as well, but
> for a thought experiment this is enough.
> The horrible implication of this failure is that a typical heart-beat or
> are-you-ok request will succeed while a typical content request will fail.
> This leads to a situation where hosts appear to be healthy, but they can't
> actually do anything.
> What will ZK do in this case?