|
Jonathan Hsieh
2012-12-14, 17:17
Stack
2012-12-14, 20:38
Ted Yu
2012-12-14, 17:37
Jonathan Hsieh
2012-12-14, 18:08
Jesse Yates
2012-12-14, 18:11
|
-
Online snapshots progress.Jonathan Hsieh 2012-12-14, 17:17
Hey folks,
I've been testing and finding bugs on a branch of online snapshots for the past few days. The good news is that taking an online snapshot seems to be fairly robust -- I've been taking online-snapshots as quickly as possible on a 5 node cluster being battered by a performance eval random write run. As expected we ran into some hiccups. In my last run of the PE/online-snapshotting, it looks like 88/100 snapshots succeeded. This is ok, some failures are actually expected (the first cut only claims better consistency than 'copytable' and 'only-on-a-sunny-day' semantics). From a quick viewing of what cause the failed cases, if splits or balancing occurs while a snapshotting, the region moves cause the final snapshot verification step to abort because we look for the new regions and don't know if we have all regions. We've also found some problems with splits of hfilelinks (HBASE-7339), and we've encountered an occasional failed-hang clone attempts (HBASE-7352), and an occasional ZK related slow abort. As they are found and characterized, I've been filing them under HBASE-6055 (offline-snapshots) or HBASE-7290 (online-snapshots). I'm going to switch from bug fixing mode back to patch polishing mode today to get some of this committed to the snapshot dev branch. Here's how I hope to deal with them moving forward. I'll be polishing the pieces I've been testing (there are about 5-7 patches in-flight currently) and putting updated pieces up for review. There is non-trivial overhead maintaining this many patches "in the future". Since this is a dev-branch, I'm going to ask reviewing these initial big dev-branch reviews focus on understandability and that your +1's would let us punt to follow-on jiras and TODOs more frequently than if you were reviewing for trunk. The sooner we get the skeleton in, the easier collaboration with other folks working and testing the same branch. Ideally, getting the large pieces in would allow follow-ons to be easier to review and tackle. The promise here, of course, is that many of these follow-on jiras, bugs (deadlocks, hangs), and testing evidence will be blockers before merging to offline snapshots to trunk and merging online snapshots to trunk. Sound good? We've initially had one snapshot branch (offline snapshots) but I'm proposing having two: the offline-snapshot branch and the online-snapshot branch. Jesse's been the master of the offline branch and pushing dev-branch patches to that branch ( https://github.com/jyates/hbase/tree/snapshots). I'd like to soon begin pushing dev-branch *reviewed commits* for online-snapshots to another branch. For those following here's an explanation of how I'm working. * The latest for review patches will be always be in review boards. * Branch committed portions (reviewed and +1'ed for the branch patches) for online snapshots will live here https://github.com/jmhsieh/hbase/tree/snapshots. My branch will periodically be force pushed to deal with rebases onto constantly updating trunk, and to include offline-branch committed patches. * The latest working and consolidated online-snapshot branch (commits correspond to HBASE jiras) will live at https://github.com/jmhsieh/hbase/tree/snapshots-work . This branch is subject to frequent forced pushes. It is a cleanup step done to prep patches for reviews, and match what eventual commits structure would look like. It also contains some patches that may be abandoned or reordered. * Rough incremental in-progress branches live here, https://github.com/jmhsieh/hbase/tree/snapshot-work-1213 (change 1213 with the latest date to see where I am). These rough branches have many small commits that focus on functionality and need to be rebased to "sprinkle" edits into the appropriate JIRA-corresponding patches. These branches will rarely if ever be force pushed. These are what I do testing from, and probably are suitable for others to use for testing. I periodically merge this with the snapshots-work mostly as a proof that what I have for review is the same as what I've been testing. Jon. // Jonathan Hsieh (shay) // Software Engineer, Cloudera // [EMAIL PROTECTED] +
Jonathan Hsieh 2012-12-14, 17:17
-
Re: Online snapshots progress.Stack 2012-12-14, 20:38
On Fri, Dec 14, 2012 at 9:17 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote:
> Sound good? > > Yes. St.Ack +
Stack 2012-12-14, 20:38
-
Re: Online snapshots progress.Ted Yu 2012-12-14, 17:37
Thanks for the update, Jon.
bq. if splits or balancing occurs while a snapshotting, the region moves cause the final snapshot verification step to abort The split or balancing happened during snapshot verification step, right ? On Fri, Dec 14, 2012 at 9:17 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > Hey folks, > > I've been testing and finding bugs on a branch of online snapshots for the > past few days. The good news is that taking an online snapshot seems to be > fairly robust -- I've been taking online-snapshots as quickly as possible > on a 5 node cluster being battered by a performance eval random write run. > > > As expected we ran into some hiccups. In my last run of the > PE/online-snapshotting, it looks like 88/100 snapshots succeeded. This is > ok, some failures are actually expected (the first cut only claims better > consistency than 'copytable' and 'only-on-a-sunny-day' semantics). From a > quick viewing of what cause the failed cases, if splits or balancing > occurs while a snapshotting, the region moves cause the final snapshot > verification step to abort because we look for the new regions and don't > know if we have all regions. We've also found some problems with splits of > hfilelinks (HBASE-7339), and we've encountered an occasional failed-hang > clone attempts (HBASE-7352), and an occasional ZK related slow abort. As > they are found and characterized, I've been filing them under HBASE-6055 > (offline-snapshots) or HBASE-7290 (online-snapshots). > > I'm going to switch from bug fixing mode back to patch polishing mode today > to get some of this committed to the snapshot dev branch. Here's how I > hope to deal with them moving forward. > > I'll be polishing the pieces I've been testing (there are about 5-7 patches > in-flight currently) and putting updated pieces up for review. There is > non-trivial overhead maintaining this many patches "in the future". Since > this is a dev-branch, I'm going to ask reviewing these initial big > dev-branch reviews focus on understandability and that your +1's would let > us punt to follow-on jiras and TODOs more frequently than if you were > reviewing for trunk. The sooner we get the skeleton in, the easier > collaboration with other folks working and testing the same branch. > Ideally, getting the large pieces in would allow follow-ons to be easier > to review and tackle. The promise here, of course, is that many of these > follow-on jiras, bugs (deadlocks, hangs), and testing evidence will be > blockers before merging to offline snapshots to trunk and merging online > snapshots to trunk. > > Sound good? > > We've initially had one snapshot branch (offline snapshots) but I'm > proposing having two: the offline-snapshot branch and the online-snapshot > branch. Jesse's been the master of the offline branch and pushing > dev-branch patches to that branch ( > https://github.com/jyates/hbase/tree/snapshots). I'd like to soon begin > pushing dev-branch *reviewed commits* for online-snapshots to another > branch. For those following here's an explanation of how I'm working. > > * The latest for review patches will be always be in review boards. > * Branch committed portions (reviewed and +1'ed for the branch patches) for > online snapshots will live here > https://github.com/jmhsieh/hbase/tree/snapshots. My branch will > periodically be force pushed to deal with rebases onto constantly updating > trunk, and to include offline-branch committed patches. > * The latest working and consolidated online-snapshot branch (commits > correspond to HBASE jiras) will live at > https://github.com/jmhsieh/hbase/tree/snapshots-work . This branch is > subject to frequent forced pushes. It is a cleanup step done to prep > patches for reviews, and match what eventual commits structure would look > like. It also contains some patches that may be abandoned or reordered. > * Rough incremental in-progress branches live here, > https://github.com/jmhsieh/hbase/tree/snapshot-work-1213 (change 1213 +
Ted Yu 2012-12-14, 17:37
-
Re: Online snapshots progress.Jonathan Hsieh 2012-12-14, 18:08
Basically, there are two meta reads -- once to get the list of servers
involved, and once after the snapshot is taken to verify that all regions in the snapshot matchup with the snapshots in meta at that point in time. I believe moves/balances when snapshot is going will cause some rs's to potentially be missed, and that and spilts may make regions new regions appear in meta that do not exist in a just taken snapshot and thus cause the snapshot verifier to reject the snapshot. Jon. On Fri, Dec 14, 2012 at 9:37 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > Thanks for the update, Jon. > > bq. if splits or balancing occurs while a snapshotting, the region moves > cause the final snapshot verification step to abort > > The split or balancing happened during snapshot verification step, right ? > > On Fri, Dec 14, 2012 at 9:17 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > > > Hey folks, > > > > I've been testing and finding bugs on a branch of online snapshots for > the > > past few days. The good news is that taking an online snapshot seems to > be > > fairly robust -- I've been taking online-snapshots as quickly as possible > > on a 5 node cluster being battered by a performance eval random write > run. > > > > > > As expected we ran into some hiccups. In my last run of the > > PE/online-snapshotting, it looks like 88/100 snapshots succeeded. This is > > ok, some failures are actually expected (the first cut only claims better > > consistency than 'copytable' and 'only-on-a-sunny-day' semantics). From a > > quick viewing of what cause the failed cases, if splits or balancing > > occurs while a snapshotting, the region moves cause the final snapshot > > verification step to abort because we look for the new regions and don't > > know if we have all regions. We've also found some problems with splits > of > > hfilelinks (HBASE-7339), and we've encountered an occasional failed-hang > > clone attempts (HBASE-7352), and an occasional ZK related slow abort. As > > they are found and characterized, I've been filing them under HBASE-6055 > > (offline-snapshots) or HBASE-7290 (online-snapshots). > > > > I'm going to switch from bug fixing mode back to patch polishing mode > today > > to get some of this committed to the snapshot dev branch. Here's how I > > hope to deal with them moving forward. > > > > I'll be polishing the pieces I've been testing (there are about 5-7 > patches > > in-flight currently) and putting updated pieces up for review. There is > > non-trivial overhead maintaining this many patches "in the future". > Since > > this is a dev-branch, I'm going to ask reviewing these initial big > > dev-branch reviews focus on understandability and that your +1's would > let > > us punt to follow-on jiras and TODOs more frequently than if you were > > reviewing for trunk. The sooner we get the skeleton in, the easier > > collaboration with other folks working and testing the same branch. > > Ideally, getting the large pieces in would allow follow-ons to be easier > > to review and tackle. The promise here, of course, is that many of > these > > follow-on jiras, bugs (deadlocks, hangs), and testing evidence will be > > blockers before merging to offline snapshots to trunk and merging online > > snapshots to trunk. > > > > Sound good? > > > > We've initially had one snapshot branch (offline snapshots) but I'm > > proposing having two: the offline-snapshot branch and the online-snapshot > > branch. Jesse's been the master of the offline branch and pushing > > dev-branch patches to that branch ( > > https://github.com/jyates/hbase/tree/snapshots). I'd like to soon begin > > pushing dev-branch *reviewed commits* for online-snapshots to another > > branch. For those following here's an explanation of how I'm working. > > > > * The latest for review patches will be always be in review boards. > > * Branch committed portions (reviewed and +1'ed for the branch patches) > for > > online snapshots will live here > > https://github.com/jmhsieh/hbase/tree/snapshots. My branch will // Jonathan Hsieh (shay) // Software Engineer, Cloudera // [EMAIL PROTECTED] +
Jonathan Hsieh 2012-12-14, 18:08
-
Re: Online snapshots progress.Jesse Yates 2012-12-14, 18:11
Loving the extensive testing Jon - good stuff.
Basically, there are two meta reads -- once to get the list of servers > involved, and once after the snapshot is taken to verify that all regions > in the snapshot matchup with the snapshots in meta at that point in time. > > I believe moves/balances when snapshot is going will cause some rs's to > potentially be missed, and that and spilts may make regions new regions > appear in meta that do not exist in a just taken snapshot and thus cause > the snapshot verifier to reject the snapshot. > Yeah, that's the right intuition, as long as nothing has really changed in the code, from what I remember :) ------------------- Jesse Yates @jesse_yates jyates.github.com On Fri, Dec 14, 2012 at 10:08 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > > > Jon. > > On Fri, Dec 14, 2012 at 9:37 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > Thanks for the update, Jon. > > > > bq. if splits or balancing occurs while a snapshotting, the region moves > > cause the final snapshot verification step to abort > > > > The split or balancing happened during snapshot verification step, right > ? > > > > On Fri, Dec 14, 2012 at 9:17 AM, Jonathan Hsieh <[EMAIL PROTECTED]> > wrote: > > > > > Hey folks, > > > > > > I've been testing and finding bugs on a branch of online snapshots for > > the > > > past few days. The good news is that taking an online snapshot seems to > > be > > > fairly robust -- I've been taking online-snapshots as quickly as > possible > > > on a 5 node cluster being battered by a performance eval random write > > run. > > > > > > > > > As expected we ran into some hiccups. In my last run of the > > > PE/online-snapshotting, it looks like 88/100 snapshots succeeded. This > is > > > ok, some failures are actually expected (the first cut only claims > better > > > consistency than 'copytable' and 'only-on-a-sunny-day' semantics). > From a > > > quick viewing of what cause the failed cases, if splits or balancing > > > occurs while a snapshotting, the region moves cause the final snapshot > > > verification step to abort because we look for the new regions and > don't > > > know if we have all regions. We've also found some problems with > splits > > of > > > hfilelinks (HBASE-7339), and we've encountered an occasional > failed-hang > > > clone attempts (HBASE-7352), and an occasional ZK related slow abort. > As > > > they are found and characterized, I've been filing them under > HBASE-6055 > > > (offline-snapshots) or HBASE-7290 (online-snapshots). > > > > > > I'm going to switch from bug fixing mode back to patch polishing mode > > today > > > to get some of this committed to the snapshot dev branch. Here's how I > > > hope to deal with them moving forward. > > > > > > I'll be polishing the pieces I've been testing (there are about 5-7 > > patches > > > in-flight currently) and putting updated pieces up for review. There > is > > > non-trivial overhead maintaining this many patches "in the future". > > Since > > > this is a dev-branch, I'm going to ask reviewing these initial big > > > dev-branch reviews focus on understandability and that your +1's would > > let > > > us punt to follow-on jiras and TODOs more frequently than if you were > > > reviewing for trunk. The sooner we get the skeleton in, the easier > > > collaboration with other folks working and testing the same branch. > > > Ideally, getting the large pieces in would allow follow-ons to be > easier > > > to review and tackle. The promise here, of course, is that many of > > these > > > follow-on jiras, bugs (deadlocks, hangs), and testing evidence will be > > > blockers before merging to offline snapshots to trunk and merging > online > > > snapshots to trunk. > > > > > > Sound good? > > > > > > We've initially had one snapshot branch (offline snapshots) but I'm > > > proposing having two: the offline-snapshot branch and the > online-snapshot > > > branch. Jesse's been the master of the offline branch and pushing +
Jesse Yates 2012-12-14, 18:11
|