-HMaster not failing over dead RegionServers
Bryan Beaudreault 2012-06-30, 05:04
Tonight in an AWS outtage we lost 11 out of 51 regionservers. All HMasters
were unaffected, but the current active master continually spammed messages
12/06/30 00:07:22 INFO ipc.Client: Retrying connect to server: /
10.125.18.129:50020. Already tried 14 time(s).
It was not following through the splitting of HLog files and didn't appear
to be moving regions off failed hosts. After giving it about 20 minutes to
try to right itself, I tried restarting the service. The restart script
just hung for a while printing dots and nothing apparent was happening on
the logs at the time. Finally I kill -9 the process, so that another
master could take over. The new master seemed to start splitting logs, but
eventually got into the same state of printing the above message.
Eventually it all worked out, but it took WAY too long (almost an hour, all
said). Is this something that is tunable? They should have instantly been
removed from the list instead of retrying so many times. Each server was
retried upwards of 30-40 times.
I am running cdh3u2 (0.90.4).