you could set fs.trash.interval into the number of minutes you want to consider that the rm'd data will lost forever. The data will be moved into .Trash and deleted after the configured time.
Second way could be to use mount.fuse to mount the HDFS and backup over that mount your data into a storage tier. That is not the best solution, but a useable way.
German Hadoop LinkedIn Group: http://goo.gl/N8pCF
On May 30, 2012, at 8:31 AM, Darrell Taylor wrote:
> Will "hadoop fs -rm -rf" move everything to the the /trash directory or
> will it delete that as well?
> I was thinking along the lines of what you suggest, keep the original
> source of the data somewhere and then reprocess it all in the event of a
> What do other people do? Do you run another cluster? Do you backup
> specific parts of the cluster? Some form of offsite SAN?
> On Tue, May 29, 2012 at 6:02 PM, Robert Evans <[EMAIL PROTECTED]> wrote:
>> Yes you will have redundancy, so no single point of hardware failure can
>> wipe out your data, short of a major catastrophe. But you can still have
>> an errant or malicious "hadoop fs -rm -rf" shut you down. If you still
>> have the original source of your data somewhere else you may be able to
>> recover, by reprocessing the data, but if this cluster is your single
>> repository for all your data you may have a problem.
>> --Bobby Evans
>> On 5/29/12 11:40 AM, "Michael Segel" <[EMAIL PROTECTED]> wrote:
>> That's not a back up strategy.
>> You could still have joe luser take out a key file or directory. What do
>> you do then?
>> On May 29, 2012, at 11:19 AM, Darrell Taylor wrote:
>>> We are about to build a 10 machine cluster with 40Tb of storage,
>>> as this gets full actually trying to create an offsite backup becomes a
>>> problem unless we build another 10 machine cluster (too expensive right
>>> now). Not sure if it will help but we have planned the cabinet into an
>>> upper and lower half with separate redundant power, then we plan to put
>>> half of the cluster in the top, half in the bottom, effectively 2 racks,
>>> in theory we could lose half the cluster and still have the copies of all
>>> the blocks with a replication factor of 3? Apart form the data centre
>>> burning down or some other disaster that would render the machines
>>> unrecoverable, is this approach good enough?
>>> I realise this is a very open question and everyone's circumstances are
>>> different, but I'm wondering what other peoples experiences/opinions are
>>> for backing up cluster data?