Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper, mail # user - watchers not fired after a disk failure?


Copy link to this message
-
watchers not fired after a disk failure?
Jeremy Stribling 2012-03-05, 22:47
I have been investigating an issue at one of our customers (our product
embeds Zookeeper in it), and here's a summary of what I've been able to
discern:

* One of the servers in a 3-server ZK cluster (ZK 3.3.3 with some
patches, C client) experiences a hardware/firmware failure of its RAID
partition (possibly affecting the network card as well).  From
/var/log/syslog:

-------------------------------
Feb 24 01:53:54 controller3 kernel: [881855.148384] megaraid_sas
0000:03:00.0: vpd r/w failed.  This is likely a firmware bug on this
device.  Contact the card vendor for a firmware update.
Feb 24 01:53:54 controller3 kernel: [1398904.161148] bnx2 0000:01:00.0:
irq 73 for MSI/MSI-X
Feb 24 01:53:54 controller3 kernel: [1398904.161157] bnx2 0000:01:00.0:
irq 74 for MSI/MSI-X
Feb 24 01:53:54 controller3 kernel: [1398904.161165] bnx2 0000:01:00.0:
irq 75 for MSI/MSI-X
Feb 24 01:53:54 controller3 kernel: [1398904.161173] bnx2 0000:01:00.0:
irq 76 for MSI/MSI-X
Feb 24 01:53:54 controller3 kernel: [1398904.161181] bnx2 0000:01:00.0:
irq 77 for MSI/MSI-X
Feb 24 01:53:54 controller3 kernel: [1398904.161188] bnx2 0000:01:00.0:
irq 78 for MSI/MSI-X
Feb 24 01:53:54 controller3 kernel: [1398904.161196] bnx2 0000:01:00.0:
irq 79 for MSI/MSI-X
Feb 24 01:53:54 controller3 kernel: [1398904.161203] bnx2 0000:01:00.0:
irq 80 for MSI/MSI-X
Feb 24 01:53:54 controller3 kernel: [1398904.161210] bnx2 0000:01:00.0:
irq 81 for MSI/MSI-X
Feb 24 01:53:54 controller3 kernel: [1398904.241931] bnx2 0000:01:00.0:
eth0: using MSIX
Feb 24 01:53:54 controller3 kernel: [1398904.243302]
ADDRCONF(NETDEV_UP): eth0: link is not ready
Feb 24 01:53:57 controller3 kernel: [1398907.309739] bnx2 0000:01:00.0:
eth0: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit
flow control ON
Feb 24 01:53:57 controller3 kernel: [1398907.311239]
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Feb 24 01:55:06 controller3 ntpdate[10764]: step time server
72.3.128.241 offset 63.134559 sec
Feb 24 01:55:06 controller3 collectd[1691]: uc_update: Value too old:
name = controller/disk-sda/disk_octets; value time = 1330048506; last
cache update = 1330048506;
Feb 24 01:55:06 controller3 collectd[1691]: Filter subsystem: Built-in
target `write': Dispatching value to all write plugins failed with
status -1.
Feb 24 01:55:06 controller3 collectd[1691]: uc_update: Value too old:
name = controller/disk-sda/disk_ops; value time = 1330048506; last cache
update = 1330048506;
Feb 24 01:55:06 controller3 collectd[1691]: Filter subsystem: Built-in
target `write': Dispatching value to all write plugins failed with
status -1.
...
Feb 24 01:55:06 controller3 collectd[1691]: Filter subsystem: Built-in
target `write': Dispatching value to all write plugins failed with
status -1.
Feb 24 01:55:10 controller3 kernel: [1398917.517473] eth0: no IPv6
routers present
Feb 24 01:55:56 controller3 kernel: [1398962.889915] bnx2 0000:01:00.0:
irq 73 for MSI/MSI-X
Feb 24 01:55:56 controller3 kernel: [1398962.889924] bnx2 0000:01:00.0:
irq 74 for MSI/MSI-X
Feb 24 01:55:56 controller3 kernel: [1398962.889932] bnx2 0000:01:00.0:
irq 75 for MSI/MSI-X
Feb 24 01:55:56 controller3 kernel: [1398962.889939] bnx2 0000:01:00.0:
irq 76 for MSI/MSI-X
Feb 24 01:55:56 controller3 kernel: [1398962.889946] bnx2 0000:01:00.0:
irq 77 for MSI/MSI-X
Feb 24 01:55:56 controller3 kernel: [1398962.889953] bnx2 0000:01:00.0:
irq 78 for MSI/MSI-X
Feb 24 01:55:56 controller3 kernel: [1398962.889960] bnx2 0000:01:00.0:
irq 79 for MSI/MSI-X
Feb 24 01:55:56 controller3 kernel: [1398962.889968] bnx2 0000:01:00.0:
irq 80 for MSI/MSI-X
Feb 24 01:55:56 controller3 kernel: [1398962.889975] bnx2 0000:01:00.0:
irq 81 for MSI/MSI-X
Feb 24 01:55:56 controller3 kernel: [1398962.970592] bnx2 0000:01:00.0:
eth0: using MSIX
Feb 24 01:55:56 controller3 kernel: [1398962.971959]
ADDRCONF(NETDEV_UP): eth0: link is not ready
Feb 24 01:55:59 controller3 kernel: [1398966.028629] bnx2 0000:01:00.0:
eth0: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit
flow control ON
Feb 24 01:55:59 controller3 kernel: [1398966.030132]
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Feb 24 01:56:05 controller3 ntpdate[10968]: step time server
72.3.128.241 offset 0.002667 sec

* This leads existing ZK client sessions to expire, and those clients
attempt to make new sessions.
* One client (on a different node) establishes a new session to this
server at 01:55:56 (coinciding exactly with the second set of IRQ
messages shown above):

2012-02-24
01:55:56,964:17928(0x7fbf0b077700):ZOO_INFO@check_events@1632: session
establishment complete on server [10.14.237.70:2888],
sessionId=0xd235ad12815a0000, negotiated timeout=6000

* This session is able to successfully create ephemeral znodes, but
watches never fire for the session.  For example, when the session has a
children watch set on /election/a, and then creates
/election/a/a_00000001, its watch on /election/a never fires (but it
does fire for sessions coming from other clients).

Unfortunately I cannot reproduce this, and we don't have good logs.  
Because of this, I cannot definitively place the blame on Zookeeper --
there are layers of our product above ZK that could theoretically be the
problem.  But given the evidence I've looked at, ZK seems to be the
likely culprit.

So my question is: is anyone aware of a failure scenario that would
cause watches to not be fired, but allow other operations from the same
session to be successful?  And are there any workarounds?  I don't
expect much help on this from the community, given the lack of logs and
evidence, but I just wanted to throw it out there and see if anyone had
any ideas off the top of their head.  Thanks,

Jeremy