-Re: corrupted edits log after power failure
Steve Loughran 2011-09-26, 14:34
On 22/09/11 20:15, Brian Bockelman wrote:
> Hi Gabi,
> I'd be a bit scared of that backup strategy; what happens if the TCP connection gets cut suddenly during curl? What happens if there's a TCP corruption? Such things have happened before.
Curl might work for long-haul backups, but I'd use HTTPS for its better
checksums, and have alternate in-cluster strategies, such as shared HA
> Personally, we have the SNN merge the edits every 15 minutes. If it hasn't happened in 30 minutes, people get emailed. If it doesn't happen in 45 minutes, people get paged.
That's a good technique for verifying the SNN is actually working.
Thinking it is working, when it isn't is danger
> In addition to writing out copies to a few disks and to NFS, we also have a versioned backup of the checkpoint.prev.
> The worst case scenario would be if the SNN corrupts the image and uploads the corrupt image (it's a theoretical situation so far...); this would be caught at the next merge, meaning we trash up to 30 minutes of work. This would ruin someone's day, but not someone's week.
> The NN is a SPOF, and should be treated with an appropriate level of paranoia (and, because it is a SPOF, assume that it will fail anyway and make sure you can accept the consequences).
That is: test your handling of the outage on a regular basis.