|
Karthik Ranganathan
2010-03-16, 18:13
tsuna
2010-03-16, 19:04
Ryan Rawson
2010-03-16, 20:17
Stack
2010-03-16, 20:49
Todd Lipcon
2010-03-16, 22:17
Karthik Ranganathan
2010-03-17, 01:04
Kannan Muthukkaruppan
2010-03-17, 01:07
Todd Lipcon
2010-03-17, 01:08
Stack
2010-03-17, 03:59
Dhruba Borthakur
2010-03-17, 04:39
Todd Lipcon
2010-03-17, 05:08
Karthik Ranganathan
2010-03-17, 17:21
Ryan Rawson
2010-03-17, 17:48
Todd Lipcon
2010-03-17, 17:55
Ryan Rawson
2010-03-17, 18:38
Todd Lipcon
2010-03-17, 21:59
Todd Lipcon
2010-03-17, 23:29
|
-
HBASE-2312 discussionKarthik Ranganathan 2010-03-16, 18:13
Hey guys,
Just wanted to close on which solution we wanted to pick for this issue - I was thinking about working on this one. There are 3 possibilities here. I have briefly written up the issue and the three solutions below. Issue: There is a very corner case when bad things could happen(ie data loss): 1) RS #1 is going to roll its HLog - not yet created the new one, old one will get no more writes 2) RS #1 enters GC Pause of Death 3) Master lists HLog files of RS#1 that is has to split as RS#1 is dead, starts splitting 4) RS #1 wakes up, created the new HLog (previous one was rolled) and appends an edit - which is lost Solution 1: 1) Master detects RS#1 is dead 2) The master renames the /hbase/.logs/<regionserver name> directory to something else (say /hbase/.logs/<regionserver name>-dead) 3) Add mkdir support (as opposed to mkdirs) to HDFS - so that a file create fails if the directory doesn't exist. Dhruba tells me this is very doable. 4) RS#1 comes back up and is not able create the new hlog. It restarts itself. NOTE: Need another HDFS API to be supported, Todd wants to avoid this. This API exists in Hadoop 0.21, but is not back-ported to 0.20. Solution 2: 1) RS #1 has written log.1, log.2, log.3 2) RS #1 is just about to write log.4 and enters gc pause before doing so 3) Master detects RS #1 dead 4) Master sees log.1, log.2, log.3. It then opens log.3 for append and also creates log.4 as a lock 5) RS #1 wakes up and isn't allowed to write to either log.3 or log.4 since HMaster holds both. NOTE: This changes the log file names, changes the create mode of the log files from overwrite = true to false. Master needs to create the last log file and open it in append mode to prevent RS from proceeding. RS will fail if it cannot create the next log file. The number of log files the RS can create will be bound. Solution 3: 1) Write "intend to roll HLog to new file hlog.N+1" to hlog.N 2) Open hlog.N+1 for append 3) Write "finished rolling" to hlog.N 4) continue writing to hlog.N+1 NOTE: This requires new types edits to go into the log file - "intent to roll" and "finished roll". Master has to open the last log file for append first. Also, master has to "chase" log files created by the region server (please see the issue for details) as there is an outside chance of log files rolling when the GC pause happens. In my opinion, from the perspective of code simplicity, I would rank the solutions as 1 being simplest, then 2, then 3. Since 1 needs another HDFS API, I was thinking that 2 seemed simpler to do and easier to verify correctness. What are your thoughts? Thanks Karthik
-
Re: HBASE-2312 discussiontsuna 2010-03-16, 19:04
On Tue, Mar 16, 2010 at 10:13 AM, Karthik Ranganathan
<[EMAIL PROTECTED]> wrote: > What are your thoughts? Why not use ZooKeeper? Each RS should hold a lock in ZK while it's alive. When the RS gets suspended for an extended period of time thanks to the magic of the GC (or for some other reason FWIW), it would lose its lock, at which point the master would notice and clean up the mess. If the RS resumes, it would notice that it lost its own lock and do the right thing (commit suicide or whatever you want). -- Benoit "tsuna" Sigoure Software Engineer @ www.StumbleUpon.com
-
Re: HBASE-2312 discussionRyan Rawson 2010-03-16, 20:17
The problem is There is a window between the gc pause ending and the
notification from zk. During This time a regionserver could do things it should not. That is the core of this issue. On Mar 16, 2010 12:15 PM, "tsuna" <[EMAIL PROTECTED]> wrote: On Tue, Mar 16, 2010 at 10:13 AM, Karthik Ranganathan <[EMAIL PROTECTED]> wrote: > What are your thoughts? Why not use ZooKeeper? Each RS should hold a lock in ZK while it's alive. When the RS gets suspended for an extended period of time thanks to the magic of the GC (or for some other reason FWIW), it would lose its lock, at which point the master would notice and clean up the mess. If the RS resumes, it would notice that it lost its own lock and do the right thing (commit suicide or whatever you want). -- Benoit "tsuna" Sigoure Software Engineer @ www.StumbleUpon.com
-
Re: HBASE-2312 discussionStack 2010-03-16, 20:49
Karthik:
Thanks for looking into this. Reading over the issue, you think option #2 "not clean" before Todd proposes changing overwrite to false. Do you still think it so? If not, then option #2 seems straight-forward. While option #3 is more code, its attractive in that its a pattern we might take on to solve other filesystem transitions; e.g. recovering failed compactions. Do you think option #3 harder to verify? The 'chasing logs' would be hard to do up in tests. Thanks, St.Ack P.S. Tsuna, up https://issues.apache.org/jira/browse/HBASE-2238 there is some discussion of why hdfs state changes has to be managed in the filesystem only, of how state can't bridge filesystem and zookeeper. On Tue, Mar 16, 2010 at 11:13 AM, Karthik Ranganathan <[EMAIL PROTECTED]> wrote: > Hey guys, > > Just wanted to close on which solution we wanted to pick for this issue - I was thinking about working on this one. There are 3 possibilities here. I have briefly written up the issue and the three solutions below. > > Issue: > There is a very corner case when bad things could happen(ie data loss): > 1) RS #1 is going to roll its HLog - not yet created the new one, old one will get no more writes > 2) RS #1 enters GC Pause of Death > 3) Master lists HLog files of RS#1 that is has to split as RS#1 is dead, starts splitting > 4) RS #1 wakes up, created the new HLog (previous one was rolled) and appends an edit - which is lost > > Solution 1: > 1) Master detects RS#1 is dead > 2) The master renames the /hbase/.logs/<regionserver name> directory to something else (say /hbase/.logs/<regionserver name>-dead) > 3) Add mkdir support (as opposed to mkdirs) to HDFS - so that a file create fails if the directory doesn't exist. Dhruba tells me this is very doable. > 4) RS#1 comes back up and is not able create the new hlog. It restarts itself. > NOTE: Need another HDFS API to be supported, Todd wants to avoid this. This API exists in Hadoop 0.21, but is not back-ported to 0.20. > > Solution 2: > 1) RS #1 has written log.1, log.2, log.3 > 2) RS #1 is just about to write log.4 and enters gc pause before doing so > 3) Master detects RS #1 dead > 4) Master sees log.1, log.2, log.3. It then opens log.3 for append and also creates log.4 as a lock > 5) RS #1 wakes up and isn't allowed to write to either log.3 or log.4 since HMaster holds both. > NOTE: This changes the log file names, changes the create mode of the log files from overwrite = true to false. Master needs to create the last log file and open it in append mode to prevent RS from proceeding. RS will fail if it cannot create the next log file. The number of log files the RS can create will be bound. > > Solution 3: > 1) Write "intend to roll HLog to new file hlog.N+1" to hlog.N > 2) Open hlog.N+1 for append > 3) Write "finished rolling" to hlog.N > 4) continue writing to hlog.N+1 > NOTE: This requires new types edits to go into the log file - "intent to roll" and "finished roll". Master has to open the last log file for append first. Also, master has to "chase" log files created by the region server (please see the issue for details) as there is an outside chance of log files rolling when the GC pause happens. > > In my opinion, from the perspective of code simplicity, I would rank the solutions as 1 being simplest, then 2, then 3. Since 1 needs another HDFS API, I was thinking that 2 seemed simpler to do and easier to verify correctness. > > What are your thoughts? > > Thanks > Karthik > > >
-
Re: HBASE-2312 discussionTodd Lipcon 2010-03-16, 22:17
On Tue, Mar 16, 2010 at 1:49 PM, Stack <[EMAIL PROTECTED]> wrote:
> Karthik: > > Thanks for looking into this. > > Reading over the issue, you think option #2 "not clean" before Todd > proposes changing overwrite to false. Do you still think it so? If > not, then option #2 seems straight-forward. > > While option #3 is more code, its attractive in that its a pattern we > might take on to solve other filesystem transitions; e.g. recovering > failed compactions. Do you think option #3 harder to verify? The > 'chasing logs' would be hard to do up in tests. > I think the "chasing logs" thing is actually avoidable pretty easily. I commented on HBASE-2312 with thoughts there. Regarding option 1, I'm not entirely against the new HDFS API, so if others think it's a good solution we may as well go with it (we're already requiring patched HDFS for sync, so another simple patch isn't a huge deal). Regarding option 2, not sure what you mean be "The number of log files the RS can create will be bound." -- can you explain? Stack's point that #3 is a useful pattern for lots of transitions seems very valid to me as well. > > Thanks, > St.Ack > > P.S. Tsuna, up https://issues.apache.org/jira/browse/HBASE-2238 there > is some discussion of why hdfs state changes has to be managed in the > filesystem only, of how state can't bridge filesystem and zookeeper. > > > On Tue, Mar 16, 2010 at 11:13 AM, Karthik Ranganathan > <[EMAIL PROTECTED]> wrote: > > Hey guys, > > > > Just wanted to close on which solution we wanted to pick for this issue - > I was thinking about working on this one. There are 3 possibilities here. I > have briefly written up the issue and the three solutions below. > > > > Issue: > > There is a very corner case when bad things could happen(ie data loss): > > 1) RS #1 is going to roll its HLog - not yet created the new one, old one > will get no more writes > > 2) RS #1 enters GC Pause of Death > > 3) Master lists HLog files of RS#1 that is has to split as RS#1 is dead, > starts splitting > > 4) RS #1 wakes up, created the new HLog (previous one was rolled) and > appends an edit - which is lost > > > > Solution 1: > > 1) Master detects RS#1 is dead > > 2) The master renames the /hbase/.logs/<regionserver name> directory to > something else (say /hbase/.logs/<regionserver name>-dead) > > 3) Add mkdir support (as opposed to mkdirs) to HDFS - so that a file > create fails if the directory doesn't exist. Dhruba tells me this is very > doable. > > 4) RS#1 comes back up and is not able create the new hlog. It restarts > itself. > > NOTE: Need another HDFS API to be supported, Todd wants to avoid this. > This API exists in Hadoop 0.21, but is not back-ported to 0.20. > > > > Solution 2: > > 1) RS #1 has written log.1, log.2, log.3 > > 2) RS #1 is just about to write log.4 and enters gc pause before doing so > > 3) Master detects RS #1 dead > > 4) Master sees log.1, log.2, log.3. It then opens log.3 for append and > also creates log.4 as a lock > > 5) RS #1 wakes up and isn't allowed to write to either log.3 or log.4 > since HMaster holds both. > > NOTE: This changes the log file names, changes the create mode of the > log files from overwrite = true to false. Master needs to create the last > log file and open it in append mode to prevent RS from proceeding. RS will > fail if it cannot create the next log file. The number of log files the RS > can create will be bound. > > > > Solution 3: > > 1) Write "intend to roll HLog to new file hlog.N+1" to hlog.N > > 2) Open hlog.N+1 for append > > 3) Write "finished rolling" to hlog.N > > 4) continue writing to hlog.N+1 > > NOTE: This requires new types edits to go into the log file - "intent to > roll" and "finished roll". Master has to open the last log file for append > first. Also, master has to "chase" log files created by the region server > (please see the issue for details) as there is an outside chance of log > files rolling when the GC pause happens. > > > > In my opinion, from the perspective of code simplicity, I would rank the Todd Lipcon Software Engineer, Cloudera
-
RE: HBASE-2312 discussionKarthik Ranganathan 2010-03-17, 01:04
@Stack: With the overwrite=false, I think option #2 looks fine.
@Todd: By "bound number of log files", I meant that the if the latest log file is log.N, the master would try to open log.N+1, log.N+2, log.N+3 etc until one of it succeeds and the RS cannot open more log files after that one. So the master "bounds" the number of times the log file is opened. I am a little nervous about the master backing off on detecting the RS's progress - because the RS has already lost its zk lease. Not sure that if the master backs off, this will allow everything to proceed smoothly. But probably calling sync() on zk makes sense. Will think about this some more. I too like option #3 because it's a useful pattern, but it was initially much easier to reason about #2. Of course #1 is the easiest either way. Again, let me think about this more. Thanks Karthik -----Original Message----- From: Todd Lipcon [mailto:[EMAIL PROTECTED]] Sent: Tuesday, March 16, 2010 3:18 PM To: [EMAIL PROTECTED] Subject: Re: HBASE-2312 discussion On Tue, Mar 16, 2010 at 1:49 PM, Stack <[EMAIL PROTECTED]> wrote: > Karthik: > > Thanks for looking into this. > > Reading over the issue, you think option #2 "not clean" before Todd > proposes changing overwrite to false. Do you still think it so? If > not, then option #2 seems straight-forward. > > While option #3 is more code, its attractive in that its a pattern we > might take on to solve other filesystem transitions; e.g. recovering > failed compactions. Do you think option #3 harder to verify? The > 'chasing logs' would be hard to do up in tests. > I think the "chasing logs" thing is actually avoidable pretty easily. I commented on HBASE-2312 with thoughts there. Regarding option 1, I'm not entirely against the new HDFS API, so if others think it's a good solution we may as well go with it (we're already requiring patched HDFS for sync, so another simple patch isn't a huge deal). Regarding option 2, not sure what you mean be "The number of log files the RS can create will be bound." -- can you explain? Stack's point that #3 is a useful pattern for lots of transitions seems very valid to me as well. > > Thanks, > St.Ack > > P.S. Tsuna, up https://issues.apache.org/jira/browse/HBASE-2238 there > is some discussion of why hdfs state changes has to be managed in the > filesystem only, of how state can't bridge filesystem and zookeeper. > > > On Tue, Mar 16, 2010 at 11:13 AM, Karthik Ranganathan > <[EMAIL PROTECTED]> wrote: > > Hey guys, > > > > Just wanted to close on which solution we wanted to pick for this issue - > I was thinking about working on this one. There are 3 possibilities here. I > have briefly written up the issue and the three solutions below. > > > > Issue: > > There is a very corner case when bad things could happen(ie data loss): > > 1) RS #1 is going to roll its HLog - not yet created the new one, old one > will get no more writes > > 2) RS #1 enters GC Pause of Death > > 3) Master lists HLog files of RS#1 that is has to split as RS#1 is dead, > starts splitting > > 4) RS #1 wakes up, created the new HLog (previous one was rolled) and > appends an edit - which is lost > > > > Solution 1: > > 1) Master detects RS#1 is dead > > 2) The master renames the /hbase/.logs/<regionserver name> directory to > something else (say /hbase/.logs/<regionserver name>-dead) > > 3) Add mkdir support (as opposed to mkdirs) to HDFS - so that a file > create fails if the directory doesn't exist. Dhruba tells me this is very > doable. > > 4) RS#1 comes back up and is not able create the new hlog. It restarts > itself. > > NOTE: Need another HDFS API to be supported, Todd wants to avoid this. > This API exists in Hadoop 0.21, but is not back-ported to 0.20. > > > > Solution 2: > > 1) RS #1 has written log.1, log.2, log.3 > > 2) RS #1 is just about to write log.4 and enters gc pause before doing so > > 3) Master detects RS #1 dead > > 4) Master sees log.1, log.2, log.3. It then opens log.3 for append and Todd Lipcon Software Engineer, Cloudera
-
RE: HBASE-2312 discussionKannan Muthukkaruppan 2010-03-17, 01:07
Option #1 seems easy to reason about and simple to implement. Can we go ahead with that if there is no major objection?
regards, Kannan -----Original Message----- From: Karthik Ranganathan [mailto:[EMAIL PROTECTED]] Sent: Tuesday, March 16, 2010 6:05 PM To: [EMAIL PROTECTED] Subject: RE: HBASE-2312 discussion @Stack: With the overwrite=false, I think option #2 looks fine. @Todd: By "bound number of log files", I meant that the if the latest log file is log.N, the master would try to open log.N+1, log.N+2, log.N+3 etc until one of it succeeds and the RS cannot open more log files after that one. So the master "bounds" the number of times the log file is opened. I am a little nervous about the master backing off on detecting the RS's progress - because the RS has already lost its zk lease. Not sure that if the master backs off, this will allow everything to proceed smoothly. But probably calling sync() on zk makes sense. Will think about this some more. I too like option #3 because it's a useful pattern, but it was initially much easier to reason about #2. Of course #1 is the easiest either way. Again, let me think about this more. Thanks Karthik -----Original Message----- From: Todd Lipcon [mailto:[EMAIL PROTECTED]] Sent: Tuesday, March 16, 2010 3:18 PM To: [EMAIL PROTECTED] Subject: Re: HBASE-2312 discussion On Tue, Mar 16, 2010 at 1:49 PM, Stack <[EMAIL PROTECTED]> wrote: > Karthik: > > Thanks for looking into this. > > Reading over the issue, you think option #2 "not clean" before Todd > proposes changing overwrite to false. Do you still think it so? If > not, then option #2 seems straight-forward. > > While option #3 is more code, its attractive in that its a pattern we > might take on to solve other filesystem transitions; e.g. recovering > failed compactions. Do you think option #3 harder to verify? The > 'chasing logs' would be hard to do up in tests. > I think the "chasing logs" thing is actually avoidable pretty easily. I commented on HBASE-2312 with thoughts there. Regarding option 1, I'm not entirely against the new HDFS API, so if others think it's a good solution we may as well go with it (we're already requiring patched HDFS for sync, so another simple patch isn't a huge deal). Regarding option 2, not sure what you mean be "The number of log files the RS can create will be bound." -- can you explain? Stack's point that #3 is a useful pattern for lots of transitions seems very valid to me as well. > > Thanks, > St.Ack > > P.S. Tsuna, up https://issues.apache.org/jira/browse/HBASE-2238 there > is some discussion of why hdfs state changes has to be managed in the > filesystem only, of how state can't bridge filesystem and zookeeper. > > > On Tue, Mar 16, 2010 at 11:13 AM, Karthik Ranganathan > <[EMAIL PROTECTED]> wrote: > > Hey guys, > > > > Just wanted to close on which solution we wanted to pick for this issue - > I was thinking about working on this one. There are 3 possibilities here. I > have briefly written up the issue and the three solutions below. > > > > Issue: > > There is a very corner case when bad things could happen(ie data loss): > > 1) RS #1 is going to roll its HLog - not yet created the new one, old one > will get no more writes > > 2) RS #1 enters GC Pause of Death > > 3) Master lists HLog files of RS#1 that is has to split as RS#1 is dead, > starts splitting > > 4) RS #1 wakes up, created the new HLog (previous one was rolled) and > appends an edit - which is lost > > > > Solution 1: > > 1) Master detects RS#1 is dead > > 2) The master renames the /hbase/.logs/<regionserver name> directory to > something else (say /hbase/.logs/<regionserver name>-dead) > > 3) Add mkdir support (as opposed to mkdirs) to HDFS - so that a file > create fails if the directory doesn't exist. Dhruba tells me this is very > doable. > > 4) RS#1 comes back up and is not able create the new hlog. It restarts > itself. > > NOTE: Need another HDFS API to be supported, Todd wants to avoid this. Todd Lipcon Software Engineer, Cloudera
-
Re: HBASE-2312 discussionTodd Lipcon 2010-03-17, 01:08
On Tue, Mar 16, 2010 at 6:04 PM, Karthik Ranganathan <
[EMAIL PROTECTED]> wrote: > @Stack: With the overwrite=false, I think option #2 looks fine. > > @Todd: By "bound number of log files", I meant that the if the latest log > file is log.N, the master would try to open log.N+1, log.N+2, log.N+3 etc > until one of it succeeds and the RS cannot open more log files after that > one. So the master "bounds" the number of times the log file is opened. > > I am a little nervous about the master backing off on detecting the RS's > progress - because the RS has already lost its zk lease. Not sure that if > the master backs off, this will allow everything to proceed smoothly. But > probably calling sync() on zk makes sense. Will think about this some more. > > I too like option #3 because it's a useful pattern, but it was initially > much easier to reason about #2. Of course #1 is the easiest either way. > Again, let me think about this more. > > What do you think about the trick of making the RS do a ZK sync before any meta op? This forces it to take at most one action after it's been terminated. > Thanks > Karthik > > > -----Original Message----- > From: Todd Lipcon [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, March 16, 2010 3:18 PM > To: [EMAIL PROTECTED] > Subject: Re: HBASE-2312 discussion > > On Tue, Mar 16, 2010 at 1:49 PM, Stack <[EMAIL PROTECTED]> wrote: > > > Karthik: > > > > Thanks for looking into this. > > > > Reading over the issue, you think option #2 "not clean" before Todd > > proposes changing overwrite to false. Do you still think it so? If > > not, then option #2 seems straight-forward. > > > > While option #3 is more code, its attractive in that its a pattern we > > might take on to solve other filesystem transitions; e.g. recovering > > failed compactions. Do you think option #3 harder to verify? The > > 'chasing logs' would be hard to do up in tests. > > > > I think the "chasing logs" thing is actually avoidable pretty easily. I > commented on HBASE-2312 with thoughts there. > > Regarding option 1, I'm not entirely against the new HDFS API, so if others > think it's a good solution we may as well go with it (we're already > requiring patched HDFS for sync, so another simple patch isn't a huge > deal). > > Regarding option 2, not sure what you mean be "The number of log files the > RS can create will be bound." -- can you explain? > > Stack's point that #3 is a useful pattern for lots of transitions seems > very > valid to me as well. > > > > > > Thanks, > > St.Ack > > > > P.S. Tsuna, up https://issues.apache.org/jira/browse/HBASE-2238 there > > is some discussion of why hdfs state changes has to be managed in the > > filesystem only, of how state can't bridge filesystem and zookeeper. > > > > > > On Tue, Mar 16, 2010 at 11:13 AM, Karthik Ranganathan > > <[EMAIL PROTECTED]> wrote: > > > Hey guys, > > > > > > Just wanted to close on which solution we wanted to pick for this issue > - > > I was thinking about working on this one. There are 3 possibilities here. > I > > have briefly written up the issue and the three solutions below. > > > > > > Issue: > > > There is a very corner case when bad things could happen(ie data loss): > > > 1) RS #1 is going to roll its HLog - not yet created the new one, old > one > > will get no more writes > > > 2) RS #1 enters GC Pause of Death > > > 3) Master lists HLog files of RS#1 that is has to split as RS#1 is > dead, > > starts splitting > > > 4) RS #1 wakes up, created the new HLog (previous one was rolled) and > > appends an edit - which is lost > > > > > > Solution 1: > > > 1) Master detects RS#1 is dead > > > 2) The master renames the /hbase/.logs/<regionserver name> directory to > > something else (say /hbase/.logs/<regionserver name>-dead) > > > 3) Add mkdir support (as opposed to mkdirs) to HDFS - so that a file > > create fails if the directory doesn't exist. Dhruba tells me this is very > > doable. > > > 4) RS#1 comes back up and is not able create the new hlog. It restarts Todd Lipcon Software Engineer, Cloudera
-
Re: HBASE-2312 discussionStack 2010-03-17, 03:59
On Tue, Mar 16, 2010 at 5:08 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote:
> > What do you think about the trick of making the RS do a ZK sync before any > meta op? This forces it to take at most one action after it's been > terminated. > ... where meta op is open of new WAL log? How would this work? RS would note in ZK the name of the WAL its about to open before it did it? If the RS then does a "Juliet" -- i.e. goes into a GC pause death-like coma -- on revivial, it'll go to open the WAL but master will have already done so, and so it'll fail? @Karthik "I am a little nervous about the master backing off on detecting the RS's progress - because the RS has already lost its zk lease." Yes. The RS will have had its 'shut-yourself-down' flag set on loss-of-lease so is on its way out. Its not going to revive so its logs need recovering. @Kannan "Option #1 seems easy to reason about and simple to implement. Can we go ahead with that if there is no major objection?" Fine by me. St.Ack
-
Re: HBASE-2312 discussionDhruba Borthakur 2010-03-17, 04:39
I like Option 1 too, it looks clean in the sense that if the Master renames
the directory, then the old region server can never write any new logs files in that directory, it is similar to IO-fencing methods uses by traditional cluster services http://en.wikipedia.org/wiki/Fencing_(computing). thanks, dhruba On Tue, Mar 16, 2010 at 6:08 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > On Tue, Mar 16, 2010 at 6:04 PM, Karthik Ranganathan < > [EMAIL PROTECTED]> wrote: > > > @Stack: With the overwrite=false, I think option #2 looks fine. > > > > @Todd: By "bound number of log files", I meant that the if the latest log > > file is log.N, the master would try to open log.N+1, log.N+2, log.N+3 etc > > until one of it succeeds and the RS cannot open more log files after that > > one. So the master "bounds" the number of times the log file is opened. > > > > I am a little nervous about the master backing off on detecting the RS's > > progress - because the RS has already lost its zk lease. Not sure that if > > the master backs off, this will allow everything to proceed smoothly. But > > probably calling sync() on zk makes sense. Will think about this some > more. > > > > I too like option #3 because it's a useful pattern, but it was initially > > much easier to reason about #2. Of course #1 is the easiest either way. > > Again, let me think about this more. > > > > > What do you think about the trick of making the RS do a ZK sync before any > meta op? This forces it to take at most one action after it's been > terminated. > > > > Thanks > > Karthik > > > > > > -----Original Message----- > > From: Todd Lipcon [mailto:[EMAIL PROTECTED]] > > Sent: Tuesday, March 16, 2010 3:18 PM > > To: [EMAIL PROTECTED] > > Subject: Re: HBASE-2312 discussion > > > > On Tue, Mar 16, 2010 at 1:49 PM, Stack <[EMAIL PROTECTED]> wrote: > > > > > Karthik: > > > > > > Thanks for looking into this. > > > > > > Reading over the issue, you think option #2 "not clean" before Todd > > > proposes changing overwrite to false. Do you still think it so? If > > > not, then option #2 seems straight-forward. > > > > > > While option #3 is more code, its attractive in that its a pattern we > > > might take on to solve other filesystem transitions; e.g. recovering > > > failed compactions. Do you think option #3 harder to verify? The > > > 'chasing logs' would be hard to do up in tests. > > > > > > > I think the "chasing logs" thing is actually avoidable pretty easily. I > > commented on HBASE-2312 with thoughts there. > > > > Regarding option 1, I'm not entirely against the new HDFS API, so if > others > > think it's a good solution we may as well go with it (we're already > > requiring patched HDFS for sync, so another simple patch isn't a huge > > deal). > > > > Regarding option 2, not sure what you mean be "The number of log files > the > > RS can create will be bound." -- can you explain? > > > > Stack's point that #3 is a useful pattern for lots of transitions seems > > very > > valid to me as well. > > > > > > > > > > Thanks, > > > St.Ack > > > > > > P.S. Tsuna, up https://issues.apache.org/jira/browse/HBASE-2238 there > > > is some discussion of why hdfs state changes has to be managed in the > > > filesystem only, of how state can't bridge filesystem and zookeeper. > > > > > > > > > On Tue, Mar 16, 2010 at 11:13 AM, Karthik Ranganathan > > > <[EMAIL PROTECTED]> wrote: > > > > Hey guys, > > > > > > > > Just wanted to close on which solution we wanted to pick for this > issue > > - > > > I was thinking about working on this one. There are 3 possibilities > here. > > I > > > have briefly written up the issue and the three solutions below. > > > > > > > > Issue: > > > > There is a very corner case when bad things could happen(ie data > loss): > > > > 1) RS #1 is going to roll its HLog - not yet created the new one, old > > one > > > will get no more writes > > > > 2) RS #1 enters GC Pause of Death > > > > 3) Master lists HLog files of RS#1 that is has to split as RS#1 is Connect to me at http://www.facebook.com/dhruba
-
Re: HBASE-2312 discussionTodd Lipcon 2010-03-17, 05:08
On Tue, Mar 16, 2010 at 8:59 PM, Stack <[EMAIL PROTECTED]> wrote:
> On Tue, Mar 16, 2010 at 5:08 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > > > > What do you think about the trick of making the RS do a ZK sync before > any > > meta op? This forces it to take at most one action after it's been > > terminated. > > > > ... where meta op is open of new WAL log? > > How would this work? RS would note in ZK the name of the WAL its > about to open before it did it? If the RS then does a "Juliet" -- > [haha, love this terminology!] > i.e. goes into a GC pause death-like coma -- on revivial, it'll go to > open the WAL but master will have already done so, and so it'll fail? > > I was actually referring to the explicit sync call in ZK: http://hadoop.apache.org/zookeeper/docs/r3.2.1/api/org/apache/zookeeper/ZooKeeper.html#sync%28java.lang.String,%20org.apache.zookeeper.AsyncCallback.VoidCallback,%20java.lang.Object%29 The javadoc isn't that clear, but the way I understand this call is that it makes sure the client's view of the world is up-to-date with respect to the ZK leader at the beginning of the sync call. The "note" box at the bottom of this section also explains it pretty well: http://hadoop.apache.org/zookeeper/docs/r3.2.2/zookeeperProgrammers.html#ch_zkGuarantees If we insert this between any transitions, I think we can ensure that the region server will only do at most one operation after losing its lease. This means that whole "chasing the log" thing is unnecessary. > @Karthik "I am a little nervous about the master backing off on > detecting the RS's progress - because the RS has already lost its zk > lease." > > Yes. The RS will have had its 'shut-yourself-down' flag set on > loss-of-lease so is on its way out. Its not going to revive so its > logs need recovering. > > @Kannan "Option #1 seems easy to reason about and simple to implement. > Can we go ahead with that if there is no major objection?" > > Fine by me. > Fine by me as well. I think we'll need solutions like 2 or 3 other places, but for this one #1 seems to work (I'll continue to think if there are any holes in our logic) -Todd -- Todd Lipcon Software Engineer, Cloudera
-
RE: HBASE-2312 discussionKarthik Ranganathan 2010-03-17, 17:21
Loved the "Juliet" terminology as well :).
@Todd: I agree we will need something like #2 or especially #3 in other places. Looks like we have a consensus - I will update the JIRA. Thanks Karthik -----Original Message----- From: Todd Lipcon [mailto:[EMAIL PROTECTED]] Sent: Tuesday, March 16, 2010 10:09 PM To: [EMAIL PROTECTED] Subject: Re: HBASE-2312 discussion On Tue, Mar 16, 2010 at 8:59 PM, Stack <[EMAIL PROTECTED]> wrote: > On Tue, Mar 16, 2010 at 5:08 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > > > > What do you think about the trick of making the RS do a ZK sync before > any > > meta op? This forces it to take at most one action after it's been > > terminated. > > > > ... where meta op is open of new WAL log? > > How would this work? RS would note in ZK the name of the WAL its > about to open before it did it? If the RS then does a "Juliet" -- > [haha, love this terminology!] > i.e. goes into a GC pause death-like coma -- on revivial, it'll go to > open the WAL but master will have already done so, and so it'll fail? > > I was actually referring to the explicit sync call in ZK: http://hadoop.apache.org/zookeeper/docs/r3.2.1/api/org/apache/zookeeper/ZooKeeper.html#sync%28java.lang.String,%20org.apache.zookeeper.AsyncCallback.VoidCallback,%20java.lang.Object%29 The javadoc isn't that clear, but the way I understand this call is that it makes sure the client's view of the world is up-to-date with respect to the ZK leader at the beginning of the sync call. The "note" box at the bottom of this section also explains it pretty well: http://hadoop.apache.org/zookeeper/docs/r3.2.2/zookeeperProgrammers.html#ch_zkGuarantees If we insert this between any transitions, I think we can ensure that the region server will only do at most one operation after losing its lease. This means that whole "chasing the log" thing is unnecessary. > @Karthik "I am a little nervous about the master backing off on > detecting the RS's progress - because the RS has already lost its zk > lease." > > Yes. The RS will have had its 'shut-yourself-down' flag set on > loss-of-lease so is on its way out. Its not going to revive so its > logs need recovering. > > @Kannan "Option #1 seems easy to reason about and simple to implement. > Can we go ahead with that if there is no major objection?" > > Fine by me. > Fine by me as well. I think we'll need solutions like 2 or 3 other places, but for this one #1 seems to work (I'll continue to think if there are any holes in our logic) -Todd -- Todd Lipcon Software Engineer, Cloudera
-
Re: RE: HBASE-2312 discussionRyan Rawson 2010-03-17, 17:48
I have a 4th option :-) I'm on the his right now and ill write it up when I
get to work. In short move the zk thread out of the rs into a monitoring parent and then you can explicitly monitor for Juliet gc pauses. More to come.... On Mar 17, 2010 10:22 AM, "Karthik Ranganathan" <[EMAIL PROTECTED]> wrote: Loved the "Juliet" terminology as well :). @Todd: I agree we will need something like #2 or especially #3 in other places. Looks like we have a consensus - I will update the JIRA. Thanks Karthik -----Original Message----- From: Todd Lipcon [mailto:[EMAIL PROTECTED]] Sent: Tuesday, March 16, 2010 10:09 PM To: [EMAIL PROTECTED] Subject: Re: HBASE-2312 discu... On Tue, Mar 16, 2010 at 8:59 PM, Stack <[EMAIL PROTECTED]> wrote: > On Tue, Mar 16, 2010 at 5:08 PM,...
-
Re: RE: HBASE-2312 discussionTodd Lipcon 2010-03-17, 17:55
On Wed, Mar 17, 2010 at 10:48 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote:
> I have a 4th option :-) I'm on the his right now and ill write it up when > I > get to work. In short move the zk thread out of the rs into a monitoring > parent and then you can explicitly monitor for Juliet gc pauses. More to > come.... > I don't think that will be correct - it might be mostly correct, but "Juliet gc pauses" are just an extra long version of what happens all the time. ZK is asynchronous, so we will never find out immediately if we've been killed. There can always be an arbitrarily long pause in between looking at ZK state and taking an action. -Todd > > On Mar 17, 2010 10:22 AM, "Karthik Ranganathan" <[EMAIL PROTECTED] > > > wrote: > > Loved the "Juliet" terminology as well :). > > @Todd: I agree we will need something like #2 or especially #3 in other > places. > > Looks like we have a consensus - I will update the JIRA. > > > Thanks > Karthik > > > -----Original Message----- > From: Todd Lipcon [mailto:[EMAIL PROTECTED]] > > Sent: Tuesday, March 16, 2010 10:09 PM > To: [EMAIL PROTECTED] > Subject: Re: HBASE-2312 discu... > > On Tue, Mar 16, 2010 at 8:59 PM, Stack <[EMAIL PROTECTED]> wrote: > > > On Tue, Mar 16, 2010 at 5:08 PM,... > -- Todd Lipcon Software Engineer, Cloudera
-
Re: RE: HBASE-2312 discussionRyan Rawson 2010-03-17, 18:38
There are 2 ways to lose your ZK session:
- you dont send pings back to ZK and it expires it (GC pause of death, network disconnect, etc) - ZK "somehow" expires your session for you. I have seen this once in a while, its rare, but painful when it happens. It didn't seem to be correlated to GC pause at the time. So here is the proposal in full: - RegionServerWatcher starts the ZK pingback, and exists to listen for termination notifications from RegionServer (via good old fashioned OS primitives). - RSW keeps the ZK node up. Keeps tabs on it's child, perhaps checking ports, or whatnot. - If RS dies, RSW kills the ZK emphermial node. No race conditions because the log append terminates before the master takes action (which it does only after the ZK notification comes in). - If a RS goes into a long GC pause, the RSW can decide to wait it out or kill -9 the RS and release the HLog. Again no race condition for the previous reason. - If a network outage takes the node out, this is where a race condition could occur. In which case, Option #1 seems super clean and awesome. It also has the advantage of being really easy to understand (always a plus at 2am). The overall advantage of my proposal is we can tune down the ZK timeout to something really small. Like 10 seconds. That way when network events take a node out of service, we can detect and respond much faster. Also with a separate process we now have the ability to react instantly to crashes without waiting for a timeout. A disadvantage is more moving parts, but we can probably abstract this away cleanly. One last thought - if we have a 10 second timeout and we have a network partition, we will see a cascade of failed regionservers. Considering that the individual RS may not be able to proceed anyways (they might have been cut off from too many datanodes to log or read hfiles), this might be inevitable. Obviously this means running HBase across a WAN is right out (we always knew that, right?), but this is why we are doing replication. On Wed, Mar 17, 2010 at 10:55 AM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > On Wed, Mar 17, 2010 at 10:48 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote: > >> I have a 4th option :-) I'm on the his right now and ill write it up when >> I >> get to work. In short move the zk thread out of the rs into a monitoring >> parent and then you can explicitly monitor for Juliet gc pauses. More to >> come.... >> > > I don't think that will be correct - it might be mostly correct, but "Juliet > gc pauses" are just an extra long version of what happens all the time. ZK > is asynchronous, so we will never find out immediately if we've been killed. > There can always be an arbitrarily long pause in between looking at ZK state > and taking an action. > > -Todd > > >> >> On Mar 17, 2010 10:22 AM, "Karthik Ranganathan" <[EMAIL PROTECTED] >> > >> wrote: >> >> Loved the "Juliet" terminology as well :). >> >> @Todd: I agree we will need something like #2 or especially #3 in other >> places. >> >> Looks like we have a consensus - I will update the JIRA. >> >> >> Thanks >> Karthik >> >> >> -----Original Message----- >> From: Todd Lipcon [mailto:[EMAIL PROTECTED]] >> >> Sent: Tuesday, March 16, 2010 10:09 PM >> To: [EMAIL PROTECTED] >> Subject: Re: HBASE-2312 discu... >> >> On Tue, Mar 16, 2010 at 8:59 PM, Stack <[EMAIL PROTECTED]> wrote: >> >> > On Tue, Mar 16, 2010 at 5:08 PM,... >> > > > > -- > Todd Lipcon > Software Engineer, Cloudera >
-
Re: RE: HBASE-2312 discussionTodd Lipcon 2010-03-17, 21:59
Hi Ryan,
I think the idea of a secondary watchdog node is a decent one, but as you mentioned, it isn't a solution for the problem at hand. The RC pause exacerbates the problem, but network blips, etc, can cause the same problem. Is there a JIRA open for the watchdog process? I think we should discuss it separately. A few weeks I had proposed on IRC the ridiculously named "SeppukuNode" which is a similar but not quite the same idea - we should hash those out on JIRA. -Todd On Wed, Mar 17, 2010 at 11:38 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote: > There are 2 ways to lose your ZK session: > > - you dont send pings back to ZK and it expires it (GC pause of death, > network disconnect, etc) > - ZK "somehow" expires your session for you. I have seen this once in > a while, its rare, but painful when it happens. It didn't seem to be > correlated to GC pause at the time. > > So here is the proposal in full: > - RegionServerWatcher starts the ZK pingback, and exists to listen for > termination notifications from RegionServer (via good old fashioned OS > primitives). > - RSW keeps the ZK node up. Keeps tabs on it's child, perhaps checking > ports, or whatnot. > - If RS dies, RSW kills the ZK emphermial node. No race conditions > because the log append terminates before the master takes action > (which it does only after the ZK notification comes in). > - If a RS goes into a long GC pause, the RSW can decide to wait it out > or kill -9 the RS and release the HLog. Again no race condition for > the previous reason. > - If a network outage takes the node out, this is where a race > condition could occur. In which case, Option #1 seems super clean and > awesome. It also has the advantage of being really easy to understand > (always a plus at 2am). > > The overall advantage of my proposal is we can tune down the ZK > timeout to something really small. Like 10 seconds. That way when > network events take a node out of service, we can detect and respond > much faster. Also with a separate process we now have the ability to > react instantly to crashes without waiting for a timeout. A > disadvantage is more moving parts, but we can probably abstract this > away cleanly. > > One last thought - if we have a 10 second timeout and we have a > network partition, we will see a cascade of failed regionservers. > Considering that the individual RS may not be able to proceed anyways > (they might have been cut off from too many datanodes to log or read > hfiles), this might be inevitable. Obviously this means running HBase > across a WAN is right out (we always knew that, right?), but this is > why we are doing replication. > > On Wed, Mar 17, 2010 at 10:55 AM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > > On Wed, Mar 17, 2010 at 10:48 AM, Ryan Rawson <[EMAIL PROTECTED]> > wrote: > > > >> I have a 4th option :-) I'm on the his right now and ill write it up > when > >> I > >> get to work. In short move the zk thread out of the rs into a monitoring > >> parent and then you can explicitly monitor for Juliet gc pauses. More to > >> come.... > >> > > > > I don't think that will be correct - it might be mostly correct, but > "Juliet > > gc pauses" are just an extra long version of what happens all the time. > ZK > > is asynchronous, so we will never find out immediately if we've been > killed. > > There can always be an arbitrarily long pause in between looking at ZK > state > > and taking an action. > > > > -Todd > > > > > >> > >> On Mar 17, 2010 10:22 AM, "Karthik Ranganathan" < > [EMAIL PROTECTED] > >> > > >> wrote: > >> > >> Loved the "Juliet" terminology as well :). > >> > >> @Todd: I agree we will need something like #2 or especially #3 in other > >> places. > >> > >> Looks like we have a consensus - I will update the JIRA. > >> > >> > >> Thanks > >> Karthik > >> > >> > >> -----Original Message----- > >> From: Todd Lipcon [mailto:[EMAIL PROTECTED]] > >> > >> Sent: Tuesday, March 16, 2010 10:09 PM > >> To: [EMAIL PROTECTED] > Todd Lipcon Software Engineer, Cloudera
-
Re: RE: HBASE-2312 discussionTodd Lipcon 2010-03-17, 23:29
I opened HBASE-2342 to discuss the watchdog node concept.
-Todd On Wed, Mar 17, 2010 at 2:59 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > Hi Ryan, > > I think the idea of a secondary watchdog node is a decent one, but as you > mentioned, it isn't a solution for the problem at hand. The RC pause > exacerbates the problem, but network blips, etc, can cause the same problem. > > Is there a JIRA open for the watchdog process? I think we should discuss it > separately. A few weeks I had proposed on IRC the ridiculously named > "SeppukuNode" which is a similar but not quite the same idea - we should > hash those out on JIRA. > > -Todd > > > On Wed, Mar 17, 2010 at 11:38 AM, Ryan Rawson <[EMAIL PROTECTED]> wrote: > >> There are 2 ways to lose your ZK session: >> >> - you dont send pings back to ZK and it expires it (GC pause of death, >> network disconnect, etc) >> - ZK "somehow" expires your session for you. I have seen this once in >> a while, its rare, but painful when it happens. It didn't seem to be >> correlated to GC pause at the time. >> >> So here is the proposal in full: >> - RegionServerWatcher starts the ZK pingback, and exists to listen for >> termination notifications from RegionServer (via good old fashioned OS >> primitives). >> - RSW keeps the ZK node up. Keeps tabs on it's child, perhaps checking >> ports, or whatnot. >> - If RS dies, RSW kills the ZK emphermial node. No race conditions >> because the log append terminates before the master takes action >> (which it does only after the ZK notification comes in). >> - If a RS goes into a long GC pause, the RSW can decide to wait it out >> or kill -9 the RS and release the HLog. Again no race condition for >> the previous reason. >> - If a network outage takes the node out, this is where a race >> condition could occur. In which case, Option #1 seems super clean and >> awesome. It also has the advantage of being really easy to understand >> (always a plus at 2am). >> >> The overall advantage of my proposal is we can tune down the ZK >> timeout to something really small. Like 10 seconds. That way when >> network events take a node out of service, we can detect and respond >> much faster. Also with a separate process we now have the ability to >> react instantly to crashes without waiting for a timeout. A >> disadvantage is more moving parts, but we can probably abstract this >> away cleanly. >> >> One last thought - if we have a 10 second timeout and we have a >> network partition, we will see a cascade of failed regionservers. >> Considering that the individual RS may not be able to proceed anyways >> (they might have been cut off from too many datanodes to log or read >> hfiles), this might be inevitable. Obviously this means running HBase >> across a WAN is right out (we always knew that, right?), but this is >> why we are doing replication. >> >> On Wed, Mar 17, 2010 at 10:55 AM, Todd Lipcon <[EMAIL PROTECTED]> wrote: >> > On Wed, Mar 17, 2010 at 10:48 AM, Ryan Rawson <[EMAIL PROTECTED]> >> wrote: >> > >> >> I have a 4th option :-) I'm on the his right now and ill write it up >> when >> >> I >> >> get to work. In short move the zk thread out of the rs into a >> monitoring >> >> parent and then you can explicitly monitor for Juliet gc pauses. More >> to >> >> come.... >> >> >> > >> > I don't think that will be correct - it might be mostly correct, but >> "Juliet >> > gc pauses" are just an extra long version of what happens all the time. >> ZK >> > is asynchronous, so we will never find out immediately if we've been >> killed. >> > There can always be an arbitrarily long pause in between looking at ZK >> state >> > and taking an action. >> > >> > -Todd >> > >> > >> >> >> >> On Mar 17, 2010 10:22 AM, "Karthik Ranganathan" < >> [EMAIL PROTECTED] >> >> > >> >> wrote: >> >> >> >> Loved the "Juliet" terminology as well :). >> >> >> >> @Todd: I agree we will need something like #2 or especially #3 in other >> >> places. >> >> >> >> Looks like we have a consensus - I will update the JIRA. Todd Lipcon Software Engineer, Cloudera |