-Re: HMaster not failing over dead RegionServers
Jimmy Xiang 2012-06-30, 15:53
The master could not detect if the region server is dead.
How do you set the zookeeper session timeout?
On Sat, Jun 30, 2012 at 8:09 AM, Stack <[EMAIL PROTECTED]> wrote:
> On Sat, Jun 30, 2012 at 7:04 AM, Bryan Beaudreault
> <[EMAIL PROTECTED]> wrote:
>> 12/06/30 00:07:22 INFO ipc.Client: Retrying connect to server: /
>> 10.125.18.129:50020. Already tried 14 time(s).
> This was one of the servers that went down?
>> It was not following through the splitting of HLog files and didn't appear
>> to be moving regions off failed hosts. After giving it about 20 minutes to
>> try to right itself, I tried restarting the service. The restart script
>> just hung for a while printing dots and nothing apparent was happening on
>> the logs at the time.
> Can we see the log Bryan?
> You might thread dump when its hung-up the next time Bryan (Would be
> something for us to do a looksee on).
>> Finally I kill -9 the process, so that another
>> master could take over. The new master seemed to start splitting logs, but
>> eventually got into the same state of printing the above message.
> You think it a particular log?
>> Eventually it all worked out, but it took WAY too long (almost an hour, all
>> said). Is this something that is tunable?
> Have RS carry less WALs? Its a configuration.
>> They should have instantly been
>> removed from the list instead of retrying so many times. Each server was
>> retried upwards of 30-40 times.
> Yeah, thats a bit silly.
> We're working on the MTTR in general. You logs would be of interest
> to a few of us if its ok that someone else can take a look.
>> I am running cdh3u2 (0.90.4).