-Re: Online snapshots progress.
Jonathan Hsieh 2012-12-14, 18:08
Basically, there are two meta reads -- once to get the list of servers
involved, and once after the snapshot is taken to verify that all regions
in the snapshot matchup with the snapshots in meta at that point in time.
I believe moves/balances when snapshot is going will cause some rs's to
potentially be missed, and that and spilts may make regions new regions
appear in meta that do not exist in a just taken snapshot and thus cause
the snapshot verifier to reject the snapshot.
On Fri, Dec 14, 2012 at 9:37 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
> Thanks for the update, Jon.
> bq. if splits or balancing occurs while a snapshotting, the region moves
> cause the final snapshot verification step to abort
> The split or balancing happened during snapshot verification step, right ?
> On Fri, Dec 14, 2012 at 9:17 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote:
> > Hey folks,
> > I've been testing and finding bugs on a branch of online snapshots for
> > past few days. The good news is that taking an online snapshot seems to
> > fairly robust -- I've been taking online-snapshots as quickly as possible
> > on a 5 node cluster being battered by a performance eval random write
> > As expected we ran into some hiccups. In my last run of the
> > PE/online-snapshotting, it looks like 88/100 snapshots succeeded. This is
> > ok, some failures are actually expected (the first cut only claims better
> > consistency than 'copytable' and 'only-on-a-sunny-day' semantics). From a
> > quick viewing of what cause the failed cases, if splits or balancing
> > occurs while a snapshotting, the region moves cause the final snapshot
> > verification step to abort because we look for the new regions and don't
> > know if we have all regions. We've also found some problems with splits
> > hfilelinks (HBASE-7339), and we've encountered an occasional failed-hang
> > clone attempts (HBASE-7352), and an occasional ZK related slow abort. As
> > they are found and characterized, I've been filing them under HBASE-6055
> > (offline-snapshots) or HBASE-7290 (online-snapshots).
> > I'm going to switch from bug fixing mode back to patch polishing mode
> > to get some of this committed to the snapshot dev branch. Here's how I
> > hope to deal with them moving forward.
> > I'll be polishing the pieces I've been testing (there are about 5-7
> > in-flight currently) and putting updated pieces up for review. There is
> > non-trivial overhead maintaining this many patches "in the future".
> > this is a dev-branch, I'm going to ask reviewing these initial big
> > dev-branch reviews focus on understandability and that your +1's would
> > us punt to follow-on jiras and TODOs more frequently than if you were
> > reviewing for trunk. The sooner we get the skeleton in, the easier
> > collaboration with other folks working and testing the same branch.
> > Ideally, getting the large pieces in would allow follow-ons to be easier
> > to review and tackle. The promise here, of course, is that many of
> > follow-on jiras, bugs (deadlocks, hangs), and testing evidence will be
> > blockers before merging to offline snapshots to trunk and merging online
> > snapshots to trunk.
> > Sound good?
> > We've initially had one snapshot branch (offline snapshots) but I'm
> > proposing having two: the offline-snapshot branch and the online-snapshot
> > branch. Jesse's been the master of the offline branch and pushing
> > dev-branch patches to that branch (
> > https://github.com/jyates/hbase/tree/snapshots). I'd like to soon begin
> > pushing dev-branch *reviewed commits* for online-snapshots to another
> > branch. For those following here's an explanation of how I'm working.
> > * The latest for review patches will be always be in review boards.
> > * Branch committed portions (reviewed and +1'ed for the branch patches)
> > online snapshots will live here
> > https://github.com/jmhsieh/hbase/tree/snapshots. My branch will
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// [EMAIL PROTECTED]