Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Backing up HDFS?


Copy link to this message
-
Re: Backing up HDFS?
We copy over selected files from HDFS to KFS and use an instance of KFS as backup file system.
We use distcp to take backup.
Lohit

----- Original Message ----
From: Allen Wittenauer <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Monday, February 9, 2009 5:22:38 PM
Subject: Re: Backing up HDFS?

On 2/9/09 4:41 PM, "Amandeep Khurana" <[EMAIL PROTECTED]> wrote:
> Why would you want to have another backup beyond HDFS? HDFS itself
> replicates your data so if the reliability of the system shouldnt be a
> concern (if at all it is)...

I'm reminded of a previous job where a site administrator refused to make
tape backups (despite our continual harassment and pointing out that he was
in violation of the contract) because he said RAID was "good enough".

Then the RAID controller failed. When we couldn't recover data "from the
other mirror" he was fired.  Not sure how they ever recovered, esp.
considering what the data was they lost.  Hopefully they had a paper trail.

To answer Nathan's question:

> On Mon, Feb 9, 2009 at 4:17 PM, Nathan Marz <[EMAIL PROTECTED]> wrote:
>
>> How do people back up their data that they keep on HDFS? We have many TB of
>> data which we need to get backed up but are unclear on how to do this
>> efficiently/reliably.

The content of our HDFSes is loaded from elsewhere and is not considered
'the source of authority'.  It is the responsibility of the original sources
to maintain backups and we then follow their policies for data retention.
For user generated content, we provide *limited* (read: quota'ed) NFS space
that is backed up regularly.

Another strategy we take is multiple grids in multiple locations that get
the data loaded simultaneously.

The key here is to prioritize your data.  Impossible to replicate data gets
backed up using whatever means necessary, hard-to-regenerate data, next
priority. Easy to regenerate and ok to nuke data, doesn't get backed up.