|
|
-
Snapshot statusJonathan Hsieh 2013-01-22, 08:58
A couple weeks ago I posted saying that we've thought we were close to
merging the snapshots branches with trunk. We were a bit optimistic. Here's an update: In a non-fault-injection workload on a 5 node cluster we've been able to have a cluster under load while it takes a bunch of online snapshots and clones them. This also robust when we kill a master kill while taking snapshots. Some of the other cases brought up by Aleks are not robust yet. We've found and fixed a bunch of exception handling problems, and non-trivial race conditions with the archiver and the restore/clone code. We've also scaled up recently. At the moment, online snapshots are not robust with a 20 node cluster. Several attempts will succeed currently, but we eventually get to a state where snapshots are not able to be taken. This is likely because of a race condition we aren't handling correctly yet. Specifically, it seems that after we encounter a NotServingRegionException, we get to a state where all subsequent snapshot requests fail. We are hunting it down currently. We've also found a race that causes failures on clone/restore due to a the hfile archiver and compactions. Matteo has posted a few workaround approaches for this problem, and we are likely to take the simplest fix.e Hopefully by the end of this week or by early next week, Matteo and myself will be able to get these bugs tackled and have robust, hardened online snapshotting and restore/clone attempt for merge. Jon. -- // Jonathan Hsieh (shay) // Software Engineer, Cloudera // [EMAIL PROTECTED] |