Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Backing up HDFS?


Copy link to this message
-
Re: Backing up HDFS?
Hey,

There's also a ticket open to enable global snapshots for a single HDFS
instance: https://issues.apache.org/jira/browse/HADOOP-3637. While this
doesn't solve the multi-site backup issue, it does provide stronger
protection against programmatic deletion of data in a single cluster.

Regards,
Jeff

On Mon, Feb 9, 2009 at 5:22 PM, Allen Wittenauer <[EMAIL PROTECTED]> wrote:

> On 2/9/09 4:41 PM, "Amandeep Khurana" <[EMAIL PROTECTED]> wrote:
> > Why would you want to have another backup beyond HDFS? HDFS itself
> > replicates your data so if the reliability of the system shouldnt be a
> > concern (if at all it is)...
>
> I'm reminded of a previous job where a site administrator refused to make
> tape backups (despite our continual harassment and pointing out that he was
> in violation of the contract) because he said RAID was "good enough".
>
> Then the RAID controller failed. When we couldn't recover data "from the
> other mirror" he was fired.  Not sure how they ever recovered, esp.
> considering what the data was they lost.  Hopefully they had a paper trail.
>
> To answer Nathan's question:
>
> > On Mon, Feb 9, 2009 at 4:17 PM, Nathan Marz <[EMAIL PROTECTED]> wrote:
> >
> >> How do people back up their data that they keep on HDFS? We have many TB
> of
> >> data which we need to get backed up but are unclear on how to do this
> >> efficiently/reliably.
>
> The content of our HDFSes is loaded from elsewhere and is not considered
> 'the source of authority'.  It is the responsibility of the original
> sources
> to maintain backups and we then follow their policies for data retention.
> For user generated content, we provide *limited* (read: quota'ed) NFS space
> that is backed up regularly.
>
> Another strategy we take is multiple grids in multiple locations that get
> the data loaded simultaneously.
>
> The key here is to prioritize your data.  Impossible to replicate data gets
> backed up using whatever means necessary, hard-to-regenerate data, next
> priority. Easy to regenerate and ok to nuke data, doesn't get backed up.
>
>