Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # dev - All region server died due to "Parent directory doesn't exist"


Copy link to this message
-
Re: All region server died due to "Parent directory doesn't exist"
Enis Söztutar 2013-05-10, 05:01
But you see the zookeeper session timeout events in RS logs, and the master
says that zk session for the RS's has expired, right?
On Thu, May 9, 2013 at 9:25 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> Still looking. Stack and Himanshu are looking too (tanks again!).
>
> What I do know is that it has to do the fencing mechanism during log
> splitting.
> Until I bounced HDFS and ZK (ZK probably being the culprit) each started
> RegionServer would immediately be fenced off (it's log directory renamed).
> Probably by the SSH.
>
> It is not clear what caused the first RS to die. While there is no direct
> evidence, from the logs it looks like the log directory was just suddenly
> renamed.
>
> I'll spend more time in the logs and also watch for this happening again.
>
> We did find another misconfigured cluster that had some services pointed
> at this cluster. It does not look like that was actually a problem - there
> is no evidence in the logs that this actually caused a problem, but it made
> this deploy somewhat "special".
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Enis Söztutar <[EMAIL PROTECTED]>
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; lars hofhansl <
> [EMAIL PROTECTED]>
> Sent: Thursday, May 9, 2013 6:10 PM
> Subject: Re: All region server died due to "Parent directory doesn't exist"
>
>
>
> Could we able to find the root cause?
>
>
>
> On Thu, May 9, 2013 at 11:28 AM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>
> Good news is that as far as I can tell no data was lost.
> >Eventually all logs were split and replayed.
> >
> >
> >
> >-- Lars
> >
> >
> >
> >----- Original Message -----
> >
> >From: lars hofhansl <[EMAIL PROTECTED]>
> >To: HBase Dev List <[EMAIL PROTECTED]>
> >
> >Cc:
> >Sent: Thursday, May 9, 2013 11:13 AM
> >Subject: Re: All region server died due to "Parent directory doesn't
> exist"
> >
> >Thanks Stack.
> >
> >I sent the logs.
> >Also, I have since bounced HDFS and ZK and the problem is gone now (I can
> start RSs again and they stay up). Something got into a weird state.
> >
> >
> >-- Lars
> >
> >
> >
> >________________________________
> >From: Stack <[EMAIL PROTECTED]>
> >To: HBase Dev List <[EMAIL PROTECTED]>; lars hofhansl <
> [EMAIL PROTECTED]>
> >Sent: Thursday, May 9, 2013 10:34 AM
> >Subject: Re: All region server died due to "Parent directory doesn't
> exist"
> >
> >
> >
> >Want to send me a regionserver log Lars? (off-list)
> >St.Ack
> >
> >
> >
> >On Thu, May 9, 2013 at 10:03 AM, lars hofhansl <[EMAIL PROTECTED]> wrote:
> >
> >Thanks Ted and Varun.
> >>
> >>
> >>Let me check on the .META. server.
> >>
> >>
> >>The majority (13) of the RSs died within 2 minutes. The remaining 3 died
> over the following 10 minutes.
> >>So that would point to general issue. I did not see any ZK issues but
> I'll double check.
> >>
> >>
> >>It is just interesting that even now, if I start and RS it aborts within
> a minute or two, because of this issue.
> >>
> >>
> >>-- Lars
> >>
> >>
> >>----- Original Message -----
> >>From: Ted Yu <[EMAIL PROTECTED]>
> >>To: [EMAIL PROTECTED]
> >>
> >>Cc:
> >>Sent: Thursday, May 9, 2013 9:51 AM
> >>Subject: Re: All region server died due to "Parent directory doesn't
> exist"
> >>
> >>Thanks Varun for sharing your experience.
> >>
> >>Lars:
> >>Was the server carrying .META. functioning properly around the time when
> >>you observed the problem ?
> >>
> >>Cheers
> >>
> >>On Thu, May 9, 2013 at 9:41 AM, Varun Sharma <[EMAIL PROTECTED]>
> wrote:
> >>
> >>> I meant no NTP/clock synchronization b/w zookeeper quorum and the HBase
> >>> cluster. I am not sure if you are seeing the exact same issue though.
> We
> >>> did not have mass failures at the same time due to this..
> >>>
> >>> Thanks
> >>> Varun
> >>>
> >>>
> >>> On Thu, May 9, 2013 at 9:39 AM, Varun Sharma <[EMAIL PROTECTED]>
> wrote:
> >>>
> >>> > Btw, I am not 100 % sure but I have some seen something like this
> before:
> >>> >
> >>> > 1) ZK connection flakiness causes ephemeral nodes to expire