Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS, mail # dev - corrupted edits log after power failure


Copy link to this message
-
Re: corrupted edits log after power failure
Steve Loughran 2011-09-26, 14:34
On 22/09/11 20:15, Brian Bockelman wrote:
> Hi Gabi,
>
> I'd be a bit scared of that backup strategy; what happens if the TCP connection gets cut suddenly during curl?  What happens if there's a TCP corruption?  Such things have happened before.

Curl might work for long-haul backups, but I'd use HTTPS for its better
checksums, and have alternate in-cluster strategies, such as shared HA
filesystems

>
> Personally, we have the SNN merge the edits every 15 minutes.  If it hasn't happened in 30 minutes, people get emailed.  If it doesn't happen in 45 minutes, people get paged.

That's a good technique for verifying the SNN is actually working.
Thinking it is working, when it isn't is danger

> In addition to writing out copies to a few disks and to NFS, we also have a versioned backup of the checkpoint.prev.
>
> The worst case scenario would be if the SNN corrupts the image and uploads the corrupt image (it's a theoretical situation so far...); this would be caught at the next merge, meaning we trash up to 30 minutes of work.  This would ruin someone's day, but not someone's week.
>
> The NN is a SPOF, and should be treated with an appropriate level of paranoia (and, because it is a SPOF, assume that it will fail anyway and make sure you can accept the consequences).

That is: test your handling of the outage on a regular basis.