A couple weeks ago I posted saying that we've thought we were close to
merging the snapshots branches with trunk. We were a bit optimistic.
Here's an update:
In a non-fault-injection workload on a 5 node cluster we've been able
to have a cluster under load while it takes a bunch of online
snapshots and clones them. This also robust when we kill a master
kill while taking snapshots. Some of the other cases brought up by
Aleks are not robust yet.
We've found and fixed a bunch of exception handling problems, and
non-trivial race conditions with the archiver and the restore/clone
We've also scaled up recently. At the moment, online snapshots are
not robust with a 20 node cluster. Several attempts will succeed
currently, but we eventually get to a state where snapshots are not
able to be taken. This is likely because of a race condition we
aren't handling correctly yet. Specifically, it seems that after we
encounter a NotServingRegionException, we get to a state where all
subsequent snapshot requests fail. We are hunting it down currently.
We've also found a race that causes failures on clone/restore due to a
the hfile archiver and compactions. Matteo has posted a few
workaround approaches for this problem, and we are likely to take the
Hopefully by the end of this week or by early next week, Matteo and
myself will be able to get these bugs tackled and have robust,
hardened online snapshotting and restore/clone attempt for merge.
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// [EMAIL PROTECTED]