There's also a ticket open to enable global snapshots for a single HDFS
instance: https://issues.apache.org/jira/browse/HADOOP-3637. While this
doesn't solve the multi-site backup issue, it does provide stronger
protection against programmatic deletion of data in a single cluster.
On Mon, Feb 9, 2009 at 5:22 PM, Allen Wittenauer <[EMAIL PROTECTED]> wrote:
> On 2/9/09 4:41 PM, "Amandeep Khurana" <[EMAIL PROTECTED]> wrote:
> > Why would you want to have another backup beyond HDFS? HDFS itself
> > replicates your data so if the reliability of the system shouldnt be a
> > concern (if at all it is)...
> I'm reminded of a previous job where a site administrator refused to make
> tape backups (despite our continual harassment and pointing out that he was
> in violation of the contract) because he said RAID was "good enough".
> Then the RAID controller failed. When we couldn't recover data "from the
> other mirror" he was fired. Not sure how they ever recovered, esp.
> considering what the data was they lost. Hopefully they had a paper trail.
> To answer Nathan's question:
> > On Mon, Feb 9, 2009 at 4:17 PM, Nathan Marz <[EMAIL PROTECTED]> wrote:
> >> How do people back up their data that they keep on HDFS? We have many TB
> >> data which we need to get backed up but are unclear on how to do this
> >> efficiently/reliably.
> The content of our HDFSes is loaded from elsewhere and is not considered
> 'the source of authority'. It is the responsibility of the original
> to maintain backups and we then follow their policies for data retention.
> For user generated content, we provide *limited* (read: quota'ed) NFS space
> that is backed up regularly.
> Another strategy we take is multiple grids in multiple locations that get
> the data loaded simultaneously.
> The key here is to prioritize your data. Impossible to replicate data gets
> backed up using whatever means necessary, hard-to-regenerate data, next
> priority. Easy to regenerate and ok to nuke data, doesn't get backed up.