|
Mac Noland
2012-01-03, 20:53
alo alt
2012-01-03, 21:10
Mac Noland
2012-01-03, 21:31
Joe Stein
2012-01-03, 21:34
Alexander Lorenz
2012-01-03, 21:42
Ted Dunning
2012-01-03, 22:07
Arun C Murthy
2012-01-03, 22:15
Ossi
2012-01-05, 14:34
|
-
Hadoop HDFS Backup/Restore SolutionsMac Noland 2012-01-03, 20:53
Good day,
I’m guessing this question been asked a myriad of times, but we’re about to get serious with some of our Hadoop implementations so I wanted to re-ask to see if I’m missing anything, or if others happen to know if this might be on a future road map. For our current storage offerings (e.g. NAS or SAN), we give businesses the opportunity to choose 7, 14, or 45 day “backups��� for their storage. The purpose of the backup isn��t so much as they are worried about losing their current data (we���re RAID’ed and have some stuff mirrored to remote datacenters), but more so if they were to delete some data today, they can recover from yesterday’s backup. Or the day before��s backup, or the day before that, etc. And to be honest, business units buy a good portion of their backups to make people feel better and fulfill custom contracts. So far with HDFS we haven’t found too many formalized offerings for this specific feature. While I haven’t done a ton of research, the best solution I’ve found is an idea where we’d schedule a job to pull the data locally to a mount that is backed up via our traditional methods. See Michael Segel’s first post on this site http://lucene.472066.n3.nabble.com/Backing-up-HDFS-td1019184.html Though we’d have to work through the details of what this would look like for our support folks, it looks like something that could potentially fit into our current model. We’d basically need to allocate the same amount of SAN or NAS disk as we have for HDFS, then coordinate a snap on the the SAN or NAS via our traditional methods. Not sure what a restore would look like, other than we could give the end users read access to the NAS or SAN mounts so they can pick through what they need to recover and let them figure out how to get it back into HDFS. For use cases like ours where we’d need multi-day backups to fulfill business needs, is this kind of what people are thinking or doing? Moreover, are there any things in the Hadoop HDFS road map for providing, for lack of a better word, an “enterprise” backup/restore solution? �� Thanks in advance, Mac Noland – Thomson Reuters
-
Re: Hadoop HDFS Backup/Restore Solutionsalo alt 2012-01-03, 21:10
Hi Mac,
hdfs has at the moment no solution for an complete backup- and restore process like ITL or ISO9000. An strategy could be to "park" the data from hdfs do you want to backup on tape with "distcp" to another backup cluster and snapshot from them with SAN mechanism. Here the DN store has to be located on the SAN box. - Alex On Tuesday, January 3, 2012, Mac Noland <[EMAIL PROTECTED]> wrote: > Good day, > > I’m guessing this question been asked a myriad of times, but > we’re about to get serious with some of our Hadoop implementations so I wanted > to re-ask to see if I’m missing anything, or if others happen to know if this might > be on a future road map. > > For our current storage offerings (e.g. NAS or SAN), we give > businesses the opportunity to choose 7, 14, or 45 day “backups” for their > storage. The purpose of the backup isn’t > so much as they are worried about losing their current data (we’re RAID’ed > and have some stuff mirrored to remote > datacenters), but more so if they were to delete some data today, they can > recover from yesterday’s backup. Or the > day before’s backup, or the day before that, etc. And to be honest, business units buy a good portion of their backups to make people feel better and fulfill custom contracts. > > > So far with HDFS we haven’t found too many formalized > offerings for this specific feature. While I haven’t done a ton of research, the best solution I’ve found is an > idea where we’d schedule a job to pull the data locally to a mount that is > backed up via our traditional methods. See Michael Segel’s first post on this site http://lucene.472066.n3.nabble.com/Backing-up-HDFS-td1019184.html > > Though we’d have to work through the details of what this > would look like for our support folks, it looks like something that could > potentially fit into our current model. We’d basically need to allocate the same amount of SAN or NAS disk as we > have for HDFS, then coordinate a snap on the the SAN or NAS via our traditional > methods. Not sure what a restore would > look like, other than we could give the end users read access to the NAS or SAN > mounts so they can pick through what they need to recover and let them figure > out how to get it back into HDFS. > > For use cases like ours where we’d need multi-day backups to > fulfill business needs, is this kind of what people are thinking or doing? Moreover, are there any things in the Hadoop > HDFS road map for providing, for lack of a better word, an “enterprise” > backup/restore solution? > > Thanks in advance, > > Mac Noland – Thomson Reuters > -- Alexander Lorenz http://mapredit.blogspot.com *P **Think of the environment: please don't print this email unless you really need to.*
-
Re: Hadoop HDFS Backup/Restore SolutionsMac Noland 2012-01-03, 21:31
Thanks for the reply Alex. To make sure I understand: 1) "park" the data by sending it over to a different cluster on a schedule (e.g. nightly is what we offer today on most things). 2) then from this secondary cluster, which is sitting idle after the distcp, do a copy local to a NFS mount pointed at SAN or NAS. 3) Then with some type of coordination (so you're not copying local when the backup happens), have the SAN or NAS device snap the data for backup. A simple restore process would be then to allow users read access to the NFS mounted storage so they can pick and choose what they want to recover via the SAN or NAS's snapshot feature - or after a "restore" to the local file system is completed by the support folks if they are using one of our older systems. Is that about right? Mac ________________________________ From: alo alt <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Mac Noland <[EMAIL PROTECTED]> Sent: Tuesday, January 3, 2012 3:10 PM Subject: Re: Hadoop HDFS Backup/Restore Solutions Hi Mac, hdfs has at the moment no solution for an complete backup- and restore process like ITL or ISO9000. An strategy could be to "park" the data from hdfs do you want to backup on tape with "distcp" to another backup cluster and snapshot from them with SAN mechanism. Here the DN store has to be located on the SAN box. - Alex On Tuesday, January 3, 2012, Mac Noland <[EMAIL PROTECTED]> wrote: > Good day, > > I’m guessing this question been asked a myriad of times, but > we’re about to get serious with some of our Hadoop implementations so I wanted > to re-ask to see if I’m missing anything, or if others happen to know if this might > be on a future road map. > > For our current storage offerings (e.g. NAS or SAN), we give > businesses the opportunity to choose 7, 14, or 45 day “backups��� for their > storage. The purpose of the backup isn���t > so much as they are worried about losing their current data (we’re RAID’ed > and have some stuff mirrored to remote > datacenters), but more so if they were to delete some data today, they can > recover from yesterday’s backup. Or the > day before’s backup, or the day before that, etc. And to be honest, business units buy a good portion of their backups to make people feel better and fulfill custom contracts. > > > So far with HDFS we haven’t found too many formalized > offerings for this specific feature. While I haven’t done a ton of research, the best solution I’ve found is an > idea where we’d schedule a job to pull the data locally to a mount that is > backed up via our traditional methods. See Michael Segel’s first post on this site http://lucene.472066.n3.nabble.com/Backing-up-HDFS-td1019184.html > > Though we’d have to work through the details of what this > would look like for our support folks, it looks like something that could > potentially fit into our current model. We’d basically need to allocate the same amount of SAN or NAS disk as we > have for HDFS, then coordinate a snap on the the SAN or NAS via our traditional > methods. Not sure what a restore would > look like, other than we could give the end users read access to the NAS or SAN > mounts so they can pick through what they need to recover and let them figure > out how to get it back into HDFS. > > For use cases like ours where we’d need multi-day backups to > fulfill business needs, is this kind of what people are thinking or doing? Moreover, are there any things in the Hadoop > HDFS road map for providing, for lack of a better word, an “enterprise” > backup/restore solution? > > Thanks in advance, > > Mac Noland – Thomson Reuters > -- Alexander Lorenz http://mapredit.blogspot.com P Think of the environment: please don't print this email unless you really need to.
-
Re: Hadoop HDFS Backup/Restore SolutionsJoe Stein 2012-01-03, 21:34
you can also distcp to AWS S3 http://wiki.apache.org/hadoop/AmazonS3 which
you can do as frequently as you like, even after the map/reduce job is done just ship it over On Tue, Jan 3, 2012 at 4:31 PM, Mac Noland <[EMAIL PROTECTED]> wrote: > > > Thanks for the reply Alex. To make sure I understand: > > 1) "park" the data by sending it over to a different cluster on a > schedule (e.g. nightly is what we offer today on most things). > 2) then from this secondary cluster, which is sitting idle after the > distcp, do a copy local to a NFS mount pointed at SAN or NAS. > 3) Then with some type of coordination (so you're not copying local when > the backup happens), have the SAN or NAS device snap the data for backup. > > A simple restore process would be then to allow users read access to the > NFS mounted storage so they can pick and choose what they want to recover > via the SAN or NAS's snapshot feature - or after a "restore" to the local > file system is completed by the support folks if they are using one of our > older systems. > > > Is that about right? > > Mac > > > > ________________________________ > From: alo alt <[EMAIL PROTECTED]> > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Mac > Noland <[EMAIL PROTECTED]> > Sent: Tuesday, January 3, 2012 3:10 PM > Subject: Re: Hadoop HDFS Backup/Restore Solutions > > > Hi Mac, > > hdfs has at the moment no solution for an complete backup- and restore > process like ITL or ISO9000. An strategy could be to "park" the data from > hdfs do you want to backup on tape with "distcp" to another backup cluster > and snapshot from them with SAN mechanism. Here the DN store has to be > located on the SAN box. > > - Alex > > On Tuesday, January 3, 2012, Mac Noland <[EMAIL PROTECTED]> wrote: > > Good day, > > > > I’m guessing this question been asked a myriad of times, but > > we’re about to get serious with some of our Hadoop implementations so I > wanted > > to re-ask to see if I’m missing anything, or if others happen to know if > this might > > be on a future road map. > > > > For our current storage offerings (e.g. NAS or SAN), we give > > businesses the opportunity to choose 7, 14, or 45 day “backups” for their > > storage. The purpose of the backup isn’t > > so much as they are worried about losing their current data (we’re > RAID’ed > > and have some stuff mirrored to remote > > datacenters), but more so if they were to delete some data today, they > can > > recover from yesterday’s backup. Or the > > day before’s backup, or the day before that, etc. And to be honest, > business units buy a good portion of their backups to make people feel > better and fulfill custom contracts. > > > > > > So far with HDFS we haven’t found too many formalized > > offerings for this specific feature. While I haven’t done a ton of > research, the best solution I’ve found is an > > idea where we’d schedule a job to pull the data locally to a mount that > is > > backed up via our traditional methods. See Michael Segel’s first post > on this site > http://lucene.472066.n3.nabble.com/Backing-up-HDFS-td1019184.html > > > > Though we’d have to work through the details of what this > > would look like for our support folks, it looks like something that could > > potentially fit into our current model. We’d basically need to allocate > the same amount of SAN or NAS disk as we > > have for HDFS, then coordinate a snap on the the SAN or NAS via our > traditional > > methods. Not sure what a restore would > > look like, other than we could give the end users read access to the NAS > or SAN > > mounts so they can pick through what they need to recover and let them > figure > > out how to get it back into HDFS. > > > > For use cases like ours where we’d need multi-day backups to > > fulfill business needs, is this kind of what people are thinking or > doing? Moreover, are there any things in the Hadoop > > HDFS road map for providing, for lack of a better word, an “enterprise” /* Joe Stein http://www.linkedin.com/in/charmalloc Twitter: @allthingshadoop <http://twitter.com/#!/allthingshadoop> */
-
Re: Hadoop HDFS Backup/Restore SolutionsAlexander Lorenz 2012-01-03, 21:42
Yes. Thats what I've done to fit ITL. Also you can export the data-dir you backup'ed over samba / nfs so people has the opportunity to restore their files easier (fuse hdfs). For smb I wrote an article in my blog.
The copy to another cluster has the charm for fast restore of lost files in the first step of your backup concept. - Alex sent via my mobile device On Jan 3, 2012, at 1:31 PM, Mac Noland <[EMAIL PROTECTED]> wrote: > > > Thanks for the reply Alex. To make sure I understand: > > 1) "park" the data by sending it over to a different cluster on a schedule (e.g. nightly is what we offer today on most things). > 2) then from this secondary cluster, which is sitting idle after the distcp, do a copy local to a NFS mount pointed at SAN or NAS. > 3) Then with some type of coordination (so you're not copying local when the backup happens), have the SAN or NAS device snap the data for backup. > > A simple restore process would be then to allow users read access to the NFS mounted storage so they can pick and choose what they want to recover via the SAN or NAS's snapshot feature - or after a "restore" to the local file system is completed by the support folks if they are using one of our older systems. > > > Is that about right? > > Mac > > > > ________________________________ > From: alo alt <[EMAIL PROTECTED]> > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Mac Noland <[EMAIL PROTECTED]> > Sent: Tuesday, January 3, 2012 3:10 PM > Subject: Re: Hadoop HDFS Backup/Restore Solutions > > > Hi Mac, > > hdfs has at the moment no solution for an complete backup- and restore process like ITL or ISO9000. An strategy could be to "park" the data from hdfs do you want to backup on tape with "distcp" to another backup cluster and snapshot from them with SAN mechanism. Here the DN store has to be located on the SAN box. > > - Alex > > On Tuesday, January 3, 2012, Mac Noland <[EMAIL PROTECTED]> wrote: >> Good day, >> >> I’m guessing this question been asked a myriad of times, but >> we’re about to get serious with some of our Hadoop implementations so I wanted >> to re-ask to see if I’m missing anything, or if others happen to know if this might >> be on a future road map. >> >> For our current storage offerings (e.g. NAS or SAN), we give >> businesses the opportunity to choose 7, 14, or 45 day “backups��� for their >> storage. The purpose of the backup isn’t >> so much as they are worried about losing their current data (we’re RAID’ed >> and have some stuff mirrored to remote >> datacenters), but more so if they were to delete some data today, they can >> recover from yesterday’s backup. Or the >> day before’s backup, or the day before that, etc. And to be honest, business units buy a good portion of their backups to make people feel better and fulfill custom contracts. >> >> >> So far with HDFS we haven’t found too many formalized >> offerings for this specific feature. While I haven’t done a ton of research, the best solution I’ve found is an >> idea where we’d schedule a job to pull the data locally to a mount that is >> backed up via our traditional methods. See Michael Segel’s first post on this site http://lucene.472066.n3.nabble.com/Backing-up-HDFS-td1019184.html >> >> Though we’d have to work through the details of what this >> would look like for our support folks, it looks like something that could >> potentially fit into our current model. We’d basically need to allocate the same amount of SAN or NAS disk as we >> have for HDFS, then coordinate a snap on the the SAN or NAS via our traditional >> methods. Not sure what a restore would >> look like, other than we could give the end users read access to the NAS or SAN >> mounts so they can pick through what they need to recover and let them figure >> out how to get it back into HDFS. >> >> For use cases like ours where we’d need multi-day backups to >> fulfill business needs, is this kind of what people are thinking or doing? Moreover, are there any things in the Hadoop
-
Re: Hadoop HDFS Backup/Restore SolutionsTed Dunning 2012-01-03, 22:07
MapR provides this out of the box in a completely Hadoop compatible
environment. Doing this with straight Hadoop involves a fair bit of baling wire. On Tue, Jan 3, 2012 at 1:10 PM, alo alt <[EMAIL PROTECTED]> wrote: > Hi Mac, > > hdfs has at the moment no solution for an complete backup- and restore > process like ITL or ISO9000. An strategy could be to "park" the data from > hdfs do you want to backup on tape with "distcp" to another backup cluster > and snapshot from them with SAN mechanism. Here the DN store has to be > located on the SAN box. > > - Alex > > On Tuesday, January 3, 2012, Mac Noland <[EMAIL PROTECTED]> wrote: > > Good day, > > > > I’m guessing this question been asked a myriad of times, but > > we’re about to get serious with some of our Hadoop implementations so I > wanted > > to re-ask to see if I’m missing anything, or if others happen to know if > this might > > be on a future road map. > > > > For our current storage offerings (e.g. NAS or SAN), we give > > businesses the opportunity to choose 7, 14, or 45 day “backups” for their > > storage. The purpose of the backup isn’t > > so much as they are worried about losing their current data (we’re > RAID’ed > > and have some stuff mirrored to remote > > datacenters), but more so if they were to delete some data today, they > can > > recover from yesterday’s backup. Or the > > day before’s backup, or the day before that, etc. And to be honest, > business units buy a good portion of their backups to make people feel > better and fulfill custom contracts. > > > > > > So far with HDFS we haven’t found too many formalized > > offerings for this specific feature. While I haven’t done a ton of > research, the best solution I’ve found is an > > idea where we’d schedule a job to pull the data locally to a mount that > is > > backed up via our traditional methods. See Michael Segel’s first post > on this site > http://lucene.472066.n3.nabble.com/Backing-up-HDFS-td1019184.html > > > > Though we’d have to work through the details of what this > > would look like for our support folks, it looks like something that could > > potentially fit into our current model. We’d basically need to allocate > the same amount of SAN or NAS disk as we > > have for HDFS, then coordinate a snap on the the SAN or NAS via our > traditional > > methods. Not sure what a restore would > > look like, other than we could give the end users read access to the NAS > or SAN > > mounts so they can pick through what they need to recover and let them > figure > > out how to get it back into HDFS. > > > > For use cases like ours where we’d need multi-day backups to > > fulfill business needs, is this kind of what people are thinking or > doing? Moreover, are there any things in the Hadoop > > HDFS road map for providing, for lack of a better word, an “enterprise” > > backup/restore solution? > > > > Thanks in advance, > > > > Mac Noland – Thomson Reuters > > > > -- > Alexander Lorenz > http://mapredit.blogspot.com > > *P **Think of the environment: please don't print this email unless you > really need to.* > > >
-
Re: Hadoop HDFS Backup/Restore SolutionsArun C Murthy 2012-01-03, 22:15
On Jan 3, 2012, at 2:07 PM, Ted Dunning wrote:
> MapR provides this out of the box in a completely Hadoop compatible environment. > Does it support *secure* Hadoop clusters? Arun > Doing this with straight Hadoop involves a fair bit of baling wire. > > On Tue, Jan 3, 2012 at 1:10 PM, alo alt <[EMAIL PROTECTED]> wrote: > Hi Mac, > > hdfs has at the moment no solution for an complete backup- and restore process like ITL or ISO9000. An strategy could be to "park" the data from hdfs do you want to backup on tape with "distcp" to another backup cluster and snapshot from them with SAN mechanism. Here the DN store has to be located on the SAN box. > > - Alex > > On Tuesday, January 3, 2012, Mac Noland <[EMAIL PROTECTED]> wrote: > > Good day, > > > > I’m guessing this question been asked a myriad of times, but > > we’re about to get serious with some of our Hadoop implementations so I wanted > > to re-ask to see if I’m missing anything, or if others happen to know if this might > > be on a future road map. > > > > For our current storage offerings (e.g. NAS or SAN), we give > > businesses the opportunity to choose 7, 14, or 45 day “backups” for their > > storage. The purpose of the backup isn’t > > so much as they are worried about losing their current data (we’re RAID’ed > > and have some stuff mirrored to remote > > datacenters), but more so if they were to delete some data today, they can > > recover from yesterday’s backup. Or the > > day before’s backup, or the day before that, etc. And to be honest, business units buy a good portion of their backups to make people feel better and fulfill custom contracts. > > > > > > So far with HDFS we haven’t found too many formalized > > offerings for this specific feature. While I haven’t done a ton of research, the best solution I’ve found is an > > idea where we’d schedule a job to pull the data locally to a mount that is > > backed up via our traditional methods. See Michael Segel’s first post on this site http://lucene.472066.n3.nabble.com/Backing-up-HDFS-td1019184.html > > > > Though we’d have to work through the details of what this > > would look like for our support folks, it looks like something that could > > potentially fit into our current model. We’d basically need to allocate the same amount of SAN or NAS disk as we > > have for HDFS, then coordinate a snap on the the SAN or NAS via our traditional > > methods. Not sure what a restore would > > look like, other than we could give the end users read access to the NAS or SAN > > mounts so they can pick through what they need to recover and let them figure > > out how to get it back into HDFS. > > > > For use cases like ours where we’d need multi-day backups to > > fulfill business needs, is this kind of what people are thinking or doing? Moreover, are there any things in the Hadoop > > HDFS road map for providing, for lack of a better word, an “enterprise” > > backup/restore solution? > > > > Thanks in advance, > > > > Mac Noland – Thomson Reuters > > > > -- > Alexander Lorenz > http://mapredit.blogspot.com > > P Think of the environment: please don't print this email unless you really need to. > > >
-
Re: Hadoop HDFS Backup/Restore SolutionsOssi 2012-01-05, 14:34
hi,
I was just going to ask this on hadoop list, but luckily I checked this one first. I've been also trying to search net about backup solutions hdfs, but there isn't too much information available. So, I'd dare to say that it hasn't been asked myriad of times. ;) I found this question (which is basically the same question I ask now) http://www.quora.com/Whats-the-right-way-to-backup-Hadoop with 4 suggestions for solution. Of those 1. hdfs + fuse -> iirc there might be some scaling problems and you still need to copy that data somewhere) 2. flume (or similar) -> at least flume isn't reliable enough, which we have been testing and using for collecting some logs to hadoop 3. high-degree of replication in hdfs -> isn't actually a backup 4. backup hdfs every hour/day/other interval to locally mounted fs: http://blog.rapleaf.com/dev/2009/06/05/backing-up-hadoops-hdfs/ In addition to above ones 5. apparently some are using distcp, but some (others or same) claim that it is unreliable. And it was mentioned here as well. 6. Then there is also Mozilla's alternative to distcp: http://blog.mozilla.com/data/2011/02/04/migrating-hbase-in-the-trenches/ So, for you (and maybe to us as well) 4. approach might be most feasible option. I quickly tested that Backup.java of 4. and it seemed to get data. However, I haven't yet done any decent tests and I have no clue how reliable or high performing it is. May have at least scaling problems. Might still be worth checking out. Regards, Ossi On Tue, Jan 3, 2012 at 10:53 PM, Mac Noland <[EMAIL PROTECTED]>wrote: > Good day, > > I’m guessing this question been asked a myriad of times, but > we’re about to get serious with some of our Hadoop implementations so I > wanted > to re-ask to see if I’m missing anything, or if others happen to know if > this might > be on a future road map. > > For our current storage offerings (e.g. NAS or SAN), we give > businesses the opportunity to choose 7, 14, or 45 day “backups” for their > storage. The purpose of the backup isn’t > so much as they are worried about losing their current data (we’re RAID’ed > and have some stuff mirrored to remote > datacenters), but more so if they were to delete some data today, they can > recover from yesterday’s backup. Or the > day before’s backup, or the day before that, etc. And to be honest, > business units buy a good portion of their backups to make people feel > better and fulfill custom contracts. > > > So far with HDFS we haven’t found too many formalized > offerings for this specific feature. While I haven’t done a ton of > research, the best solution I’ve found is an > idea where we’d schedule a job to pull the data locally to a mount that is > backed up via our traditional methods. See Michael Segel’s first post on > this site > http://lucene.472066.n3.nabble.com/Backing-up-HDFS-td1019184.html > > Though we’d have to work through the details of what this > would look like for our support folks, it looks like something that could > potentially fit into our current model. We’d basically need to allocate > the same amount of SAN or NAS disk as we > have for HDFS, then coordinate a snap on the the SAN or NAS via our > traditional > methods. Not sure what a restore would > look like, other than we could give the end users read access to the NAS > or SAN > mounts so they can pick through what they need to recover and let them > figure > out how to get it back into HDFS. > > For use cases like ours where we’d need multi-day backups to > fulfill business needs, is this kind of what people are thinking or > doing? Moreover, are there any things in the Hadoop > HDFS road map for providing, for lack of a better word, an “enterprise” > backup/restore solution? > > Thanks in advance, > > Mac Noland – Thomson Reuters > |