Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS, mail # dev - corrupted edits log after power failure


+
Gabi Kazav 2011-09-22, 08:48
+
Kihwal Lee 2011-09-22, 17:37
Copy link to this message
-
Re: corrupted edits log after power failure
Brian Bockelman 2011-09-22, 19:15
Hi Gabi,

I'd be a bit scared of that backup strategy; what happens if the TCP connection gets cut suddenly during curl?  What happens if there's a TCP corruption?  Such things have happened before.

Personally, we have the SNN merge the edits every 15 minutes.  If it hasn't happened in 30 minutes, people get emailed.  If it doesn't happen in 45 minutes, people get paged.

In addition to writing out copies to a few disks and to NFS, we also have a versioned backup of the checkpoint.prev.

The worst case scenario would be if the SNN corrupts the image and uploads the corrupt image (it's a theoretical situation so far...); this would be caught at the next merge, meaning we trash up to 30 minutes of work.  This would ruin someone's day, but not someone's week.

The NN is a SPOF, and should be treated with an appropriate level of paranoia (and, because it is a SPOF, assume that it will fail anyway and make sure you can accept the consequences).

Brian

On Sep 22, 2011, at 3:48 AM, Gabi Kazav wrote:

> Hi,
>
> I had Power Failure.
> I have backup of files: edits, fsimage.
>
> I am backing it up with:
>
> curl -s http://nameNode:50070/getimage?getimage=1 > fsimage
> curl -s http://nameNode:50070/getimage?getedits=1 > edits
>
> When I am trying to start the HDFS with the recovered files, I got error about the edits file : "Error replaying edit log at offset 1921"
>
> Also, I have edits.new file, when I rename it to edits I got: "ERROR org.apache.hadoop.hdfs.server.common.Storage: Error replaying edit log at offset 2494103"
>
> What is the problem?!
>
>
> And from now on, how can I do a backup that works?! :)
>
> Thanks,
> Gabi.
>
>
>
>
> Gabi Kazav
> IT Manager And Infrastructure Engineer
> [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> | www.pursway.com<http://www.pursway.com/>
> Mailing address PO Box 4125, Herzliya 46140
> Address 8 Hamada St., Herzliya, IL | Tel +972 527 772457| Fax + 972 9 958 4736
>
+
Steve Loughran 2011-09-26, 14:34