I've been testing and finding bugs on a branch of online snapshots for the
past few days. The good news is that taking an online snapshot seems to be
fairly robust -- I've been taking online-snapshots as quickly as possible
on a 5 node cluster being battered by a performance eval random write run.
As expected we ran into some hiccups. In my last run of the
PE/online-snapshotting, it looks like 88/100 snapshots succeeded. This is
ok, some failures are actually expected (the first cut only claims better
consistency than 'copytable' and 'only-on-a-sunny-day' semantics). From a
quick viewing of what cause the failed cases, if splits or balancing
occurs while a snapshotting, the region moves cause the final snapshot
verification step to abort because we look for the new regions and don't
know if we have all regions. We've also found some problems with splits of
hfilelinks (HBASE-7339), and we've encountered an occasional failed-hang
clone attempts (HBASE-7352), and an occasional ZK related slow abort. As
they are found and characterized, I've been filing them under HBASE-6055
(offline-snapshots) or HBASE-7290 (online-snapshots).
I'm going to switch from bug fixing mode back to patch polishing mode today
to get some of this committed to the snapshot dev branch. Here's how I
hope to deal with them moving forward.
I'll be polishing the pieces I've been testing (there are about 5-7 patches
in-flight currently) and putting updated pieces up for review. There is
non-trivial overhead maintaining this many patches "in the future". Since
this is a dev-branch, I'm going to ask reviewing these initial big
dev-branch reviews focus on understandability and that your +1's would let
us punt to follow-on jiras and TODOs more frequently than if you were
reviewing for trunk. The sooner we get the skeleton in, the easier
collaboration with other folks working and testing the same branch.
Ideally, getting the large pieces in would allow follow-ons to be easier
to review and tackle. The promise here, of course, is that many of these
follow-on jiras, bugs (deadlocks, hangs), and testing evidence will be
blockers before merging to offline snapshots to trunk and merging online
snapshots to trunk.
We've initially had one snapshot branch (offline snapshots) but I'm
proposing having two: the offline-snapshot branch and the online-snapshot
branch. Jesse's been the master of the offline branch and pushing
dev-branch patches to that branch (
https://github.com/jyates/hbase/tree/snapshots). I'd like to soon begin
pushing dev-branch *reviewed commits* for online-snapshots to another
branch. For those following here's an explanation of how I'm working.
* The latest for review patches will be always be in review boards.
* Branch committed portions (reviewed and +1'ed for the branch patches) for
online snapshots will live here
https://github.com/jmhsieh/hbase/tree/snapshots. My branch will
periodically be force pushed to deal with rebases onto constantly updating
trunk, and to include offline-branch committed patches.
* The latest working and consolidated online-snapshot branch (commits
correspond to HBASE jiras) will live at
https://github.com/jmhsieh/hbase/tree/snapshots-work . This branch is
subject to frequent forced pushes. It is a cleanup step done to prep
patches for reviews, and match what eventual commits structure would look
like. It also contains some patches that may be abandoned or reordered.
* Rough incremental in-progress branches live here,
https://github.com/jmhsieh/hbase/tree/snapshot-work-1213 (change 1213 with
the latest date to see where I am). These rough branches have many small
commits that focus on functionality and need to be rebased to "sprinkle"
edits into the appropriate JIRA-corresponding patches. These branches
will rarely if ever be force pushed. These are what I do testing from,
and probably are suitable for others to use for testing. I periodically
merge this with the snapshots-work mostly as a proof that what I have for
review is the same as what I've been testing.
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// [EMAIL PROTECTED]
Stack 2012-12-14, 20:38
Ted Yu 2012-12-14, 17:37
Jonathan Hsieh 2012-12-14, 18:08
Jesse Yates 2012-12-14, 18:11