Loving the extensive testing Jon - good stuff.
Basically, there are two meta reads -- once to get the list of servers
> involved, and once after the snapshot is taken to verify that all regions
> in the snapshot matchup with the snapshots in meta at that point in time.
> I believe moves/balances when snapshot is going will cause some rs's to
> potentially be missed, and that and spilts may make regions new regions
> appear in meta that do not exist in a just taken snapshot and thus cause
> the snapshot verifier to reject the snapshot.
Yeah, that's the right intuition, as long as nothing has really changed in
the code, from what I remember :)
On Fri, Dec 14, 2012 at 10:08 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote:
> On Fri, Dec 14, 2012 at 9:37 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
> > Thanks for the update, Jon.
> > bq. if splits or balancing occurs while a snapshotting, the region moves
> > cause the final snapshot verification step to abort
> > The split or balancing happened during snapshot verification step, right
> > On Fri, Dec 14, 2012 at 9:17 AM, Jonathan Hsieh <[EMAIL PROTECTED]>
> > > Hey folks,
> > >
> > > I've been testing and finding bugs on a branch of online snapshots for
> > the
> > > past few days. The good news is that taking an online snapshot seems to
> > be
> > > fairly robust -- I've been taking online-snapshots as quickly as
> > > on a 5 node cluster being battered by a performance eval random write
> > run.
> > >
> > >
> > > As expected we ran into some hiccups. In my last run of the
> > > PE/online-snapshotting, it looks like 88/100 snapshots succeeded. This
> > > ok, some failures are actually expected (the first cut only claims
> > > consistency than 'copytable' and 'only-on-a-sunny-day' semantics).
> From a
> > > quick viewing of what cause the failed cases, if splits or balancing
> > > occurs while a snapshotting, the region moves cause the final snapshot
> > > verification step to abort because we look for the new regions and
> > > know if we have all regions. We've also found some problems with
> > of
> > > hfilelinks (HBASE-7339), and we've encountered an occasional
> > > clone attempts (HBASE-7352), and an occasional ZK related slow abort.
> > > they are found and characterized, I've been filing them under
> > > (offline-snapshots) or HBASE-7290 (online-snapshots).
> > >
> > > I'm going to switch from bug fixing mode back to patch polishing mode
> > today
> > > to get some of this committed to the snapshot dev branch. Here's how I
> > > hope to deal with them moving forward.
> > >
> > > I'll be polishing the pieces I've been testing (there are about 5-7
> > patches
> > > in-flight currently) and putting updated pieces up for review. There
> > > non-trivial overhead maintaining this many patches "in the future".
> > Since
> > > this is a dev-branch, I'm going to ask reviewing these initial big
> > > dev-branch reviews focus on understandability and that your +1's would
> > let
> > > us punt to follow-on jiras and TODOs more frequently than if you were
> > > reviewing for trunk. The sooner we get the skeleton in, the easier
> > > collaboration with other folks working and testing the same branch.
> > > Ideally, getting the large pieces in would allow follow-ons to be
> > > to review and tackle. The promise here, of course, is that many of
> > these
> > > follow-on jiras, bugs (deadlocks, hangs), and testing evidence will be
> > > blockers before merging to offline snapshots to trunk and merging
> > > snapshots to trunk.
> > >
> > > Sound good?
> > >
> > > We've initially had one snapshot branch (offline snapshots) but I'm
> > > proposing having two: the offline-snapshot branch and the
> > > branch. Jesse's been the master of the offline branch and pushing