-Re: HMaster not failing over dead RegionServers
Bryan Beaudreault 2012-07-03, 00:17
Thanks a bunch for the insight. This message was actually coming from
master, but it still needs to grab the HLog files from hdfs, so I can still
see it being what you mentioned. I'm going to look into tuning these
parameters down in preparation for future failures.
On Mon, Jul 2, 2012 at 7:56 PM, Suraj Varma <[EMAIL PROTECTED]> wrote:
> This looks like it is trying to reach a datanode ... doesn't it?
> > 12/06/30 00:07:22 INFO ipc.Client: Retrying connect to server: /
> 10.125.18.129:50020. Already tried 14 time(s).
> Is this from a master log or from a region server log? (I'm guess the
> above is from a region server log while trying to replay hlogs)
> Sometime back, we had a similar symptom (HLog splitting takes the long
> time due to the retries) and found that even though the datanode died,
> it was not being detected by the namenode. This leads to the region
> server retrying over dead datanodes over and over stretching out the
> splitting process.
> See this thread:
> http://www.mail-archive.com/[EMAIL PROTECTED]/msg10033.html
> We found that by default, it takes 15 mins for a datanode death to be
> detected by a NN ... and this seems to cause the NN serving back the
> dead DN as a valid one when RS tries to read the hlogs.
> The parameters in question are: dfs.heartbeat.recheck.interval and
> heartbeat.recheck.interval ... tweaking this down caused the recovery
> to be much faster.
> Also - hbase.rpc.timeout and zookeeper.session.timeout are two other
> configurations that need to be tweaked down from defaults for quick
> Not sure if this is the case in your error - but, might be something
> to investigate ...
> On Sat, Jun 30, 2012 at 8:53 AM, Jimmy Xiang <[EMAIL PROTECTED]> wrote:
> > Bryan,
> > The master could not detect if the region server is dead.
> > How do you set the zookeeper session timeout?
> > Thanks,
> > Jimmy
> > On Sat, Jun 30, 2012 at 8:09 AM, Stack <[EMAIL PROTECTED]> wrote:
> >> On Sat, Jun 30, 2012 at 7:04 AM, Bryan Beaudreault
> >> <[EMAIL PROTECTED]> wrote:
> >>> 12/06/30 00:07:22 INFO ipc.Client: Retrying connect to server: /
> >>> 10.125.18.129:50020. Already tried 14 time(s).
> >> This was one of the servers that went down?
> >>> It was not following through the splitting of HLog files and didn't
> >>> to be moving regions off failed hosts. After giving it about 20
> minutes to
> >>> try to right itself, I tried restarting the service. The restart
> >>> just hung for a while printing dots and nothing apparent was happening
> >>> the logs at the time.
> >> Can we see the log Bryan?
> >> You might thread dump when its hung-up the next time Bryan (Would be
> >> something for us to do a looksee on).
> >>> Finally I kill -9 the process, so that another
> >>> master could take over. The new master seemed to start splitting
> logs, but
> >>> eventually got into the same state of printing the above message.
> >> You think it a particular log?
> >>> Eventually it all worked out, but it took WAY too long (almost an
> hour, all
> >>> said). Is this something that is tunable?
> >> Have RS carry less WALs? Its a configuration.
> >>> They should have instantly been
> >>> removed from the list instead of retrying so many times. Each server
> >>> retried upwards of 30-40 times.
> >> Yeah, thats a bit silly.
> >> We're working on the MTTR in general. You logs would be of interest
> >> to a few of us if its ok that someone else can take a look.
> >> St.Ack
> >>> I am running cdh3u2 (0.90.4).
> >>> Thanks,
> >>> Bryan