-Re: Online snapshots progress.
Ted Yu 2012-12-14, 17:37
Thanks for the update, Jon.
bq. if splits or balancing occurs while a snapshotting, the region moves
cause the final snapshot verification step to abort
The split or balancing happened during snapshot verification step, right ?
On Fri, Dec 14, 2012 at 9:17 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote:
> Hey folks,
> I've been testing and finding bugs on a branch of online snapshots for the
> past few days. The good news is that taking an online snapshot seems to be
> fairly robust -- I've been taking online-snapshots as quickly as possible
> on a 5 node cluster being battered by a performance eval random write run.
> As expected we ran into some hiccups. In my last run of the
> PE/online-snapshotting, it looks like 88/100 snapshots succeeded. This is
> ok, some failures are actually expected (the first cut only claims better
> consistency than 'copytable' and 'only-on-a-sunny-day' semantics). From a
> quick viewing of what cause the failed cases, if splits or balancing
> occurs while a snapshotting, the region moves cause the final snapshot
> verification step to abort because we look for the new regions and don't
> know if we have all regions. We've also found some problems with splits of
> hfilelinks (HBASE-7339), and we've encountered an occasional failed-hang
> clone attempts (HBASE-7352), and an occasional ZK related slow abort. As
> they are found and characterized, I've been filing them under HBASE-6055
> (offline-snapshots) or HBASE-7290 (online-snapshots).
> I'm going to switch from bug fixing mode back to patch polishing mode today
> to get some of this committed to the snapshot dev branch. Here's how I
> hope to deal with them moving forward.
> I'll be polishing the pieces I've been testing (there are about 5-7 patches
> in-flight currently) and putting updated pieces up for review. There is
> non-trivial overhead maintaining this many patches "in the future". Since
> this is a dev-branch, I'm going to ask reviewing these initial big
> dev-branch reviews focus on understandability and that your +1's would let
> us punt to follow-on jiras and TODOs more frequently than if you were
> reviewing for trunk. The sooner we get the skeleton in, the easier
> collaboration with other folks working and testing the same branch.
> Ideally, getting the large pieces in would allow follow-ons to be easier
> to review and tackle. The promise here, of course, is that many of these
> follow-on jiras, bugs (deadlocks, hangs), and testing evidence will be
> blockers before merging to offline snapshots to trunk and merging online
> snapshots to trunk.
> Sound good?
> We've initially had one snapshot branch (offline snapshots) but I'm
> proposing having two: the offline-snapshot branch and the online-snapshot
> branch. Jesse's been the master of the offline branch and pushing
> dev-branch patches to that branch (
> https://github.com/jyates/hbase/tree/snapshots). I'd like to soon begin
> pushing dev-branch *reviewed commits* for online-snapshots to another
> branch. For those following here's an explanation of how I'm working.
> * The latest for review patches will be always be in review boards.
> * Branch committed portions (reviewed and +1'ed for the branch patches) for
> online snapshots will live here
> https://github.com/jmhsieh/hbase/tree/snapshots. My branch will
> periodically be force pushed to deal with rebases onto constantly updating
> trunk, and to include offline-branch committed patches.
> * The latest working and consolidated online-snapshot branch (commits
> correspond to HBASE jiras) will live at
> https://github.com/jmhsieh/hbase/tree/snapshots-work . This branch is
> subject to frequent forced pushes. It is a cleanup step done to prep
> patches for reviews, and match what eventual commits structure would look
> like. It also contains some patches that may be abandoned or reordered.
> * Rough incremental in-progress branches live here,
> https://github.com/jmhsieh/hbase/tree/snapshot-work-1213 (change 1213