|
|
-
Re: corrupted edits log after power failureSteve Loughran 2011-09-26, 14:34
On 22/09/11 20:15, Brian Bockelman wrote:
> Hi Gabi, > > I'd be a bit scared of that backup strategy; what happens if the TCP connection gets cut suddenly during curl? What happens if there's a TCP corruption? Such things have happened before. Curl might work for long-haul backups, but I'd use HTTPS for its better checksums, and have alternate in-cluster strategies, such as shared HA filesystems > > Personally, we have the SNN merge the edits every 15 minutes. If it hasn't happened in 30 minutes, people get emailed. If it doesn't happen in 45 minutes, people get paged. That's a good technique for verifying the SNN is actually working. Thinking it is working, when it isn't is danger > In addition to writing out copies to a few disks and to NFS, we also have a versioned backup of the checkpoint.prev. > > The worst case scenario would be if the SNN corrupts the image and uploads the corrupt image (it's a theoretical situation so far...); this would be caught at the next merge, meaning we trash up to 30 minutes of work. This would ruin someone's day, but not someone's week. > > The NN is a SPOF, and should be treated with an appropriate level of paranoia (and, because it is a SPOF, assume that it will fail anyway and make sure you can accept the consequences). That is: test your handling of the outage on a regular basis. |