|
|
-
Re: Online snapshots progress.Jesse Yates 2012-12-14, 18:11
Loving the extensive testing Jon - good stuff.
Basically, there are two meta reads -- once to get the list of servers > involved, and once after the snapshot is taken to verify that all regions > in the snapshot matchup with the snapshots in meta at that point in time. > > I believe moves/balances when snapshot is going will cause some rs's to > potentially be missed, and that and spilts may make regions new regions > appear in meta that do not exist in a just taken snapshot and thus cause > the snapshot verifier to reject the snapshot. > Yeah, that's the right intuition, as long as nothing has really changed in the code, from what I remember :) ------------------- Jesse Yates @jesse_yates jyates.github.com On Fri, Dec 14, 2012 at 10:08 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > > > Jon. > > On Fri, Dec 14, 2012 at 9:37 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > Thanks for the update, Jon. > > > > bq. if splits or balancing occurs while a snapshotting, the region moves > > cause the final snapshot verification step to abort > > > > The split or balancing happened during snapshot verification step, right > ? > > > > On Fri, Dec 14, 2012 at 9:17 AM, Jonathan Hsieh <[EMAIL PROTECTED]> > wrote: > > > > > Hey folks, > > > > > > I've been testing and finding bugs on a branch of online snapshots for > > the > > > past few days. The good news is that taking an online snapshot seems to > > be > > > fairly robust -- I've been taking online-snapshots as quickly as > possible > > > on a 5 node cluster being battered by a performance eval random write > > run. > > > > > > > > > As expected we ran into some hiccups. In my last run of the > > > PE/online-snapshotting, it looks like 88/100 snapshots succeeded. This > is > > > ok, some failures are actually expected (the first cut only claims > better > > > consistency than 'copytable' and 'only-on-a-sunny-day' semantics). > From a > > > quick viewing of what cause the failed cases, if splits or balancing > > > occurs while a snapshotting, the region moves cause the final snapshot > > > verification step to abort because we look for the new regions and > don't > > > know if we have all regions. We've also found some problems with > splits > > of > > > hfilelinks (HBASE-7339), and we've encountered an occasional > failed-hang > > > clone attempts (HBASE-7352), and an occasional ZK related slow abort. > As > > > they are found and characterized, I've been filing them under > HBASE-6055 > > > (offline-snapshots) or HBASE-7290 (online-snapshots). > > > > > > I'm going to switch from bug fixing mode back to patch polishing mode > > today > > > to get some of this committed to the snapshot dev branch. Here's how I > > > hope to deal with them moving forward. > > > > > > I'll be polishing the pieces I've been testing (there are about 5-7 > > patches > > > in-flight currently) and putting updated pieces up for review. There > is > > > non-trivial overhead maintaining this many patches "in the future". > > Since > > > this is a dev-branch, I'm going to ask reviewing these initial big > > > dev-branch reviews focus on understandability and that your +1's would > > let > > > us punt to follow-on jiras and TODOs more frequently than if you were > > > reviewing for trunk. The sooner we get the skeleton in, the easier > > > collaboration with other folks working and testing the same branch. > > > Ideally, getting the large pieces in would allow follow-ons to be > easier > > > to review and tackle. The promise here, of course, is that many of > > these > > > follow-on jiras, bugs (deadlocks, hangs), and testing evidence will be > > > blockers before merging to offline snapshots to trunk and merging > online > > > snapshots to trunk. > > > > > > Sound good? > > > > > > We've initially had one snapshot branch (offline snapshots) but I'm > > > proposing having two: the offline-snapshot branch and the > online-snapshot > > > branch. Jesse's been the master of the offline branch and pushing |