|
|
-
How to remove three disks from three different nodes in a ten node cluster in less than an hour without losing replicas?
Stack 2013-01-31, 06:34
Here is a little puzzle.
An admin works for a cash-strapped, popular web shop. At the datacenter she has a ten node cluster that is heavily used. It runs hot all day long and decommissioning a node with its background replicating of 12 disks worth of data messes up the work load she has on top of it and makes her clients very unhappy. Replicating the data of one node takes at least an hour. This cluster has three bad disks in three different nodes (replication factor is 3). The admin lives an hour from the datacenter. She can't afford a cage monkey and so must replace the disks herself.
If she left home at 2pm and had to be back by 6pm before the kids came home from school, how would she replace the three disks without for sure losing a replica?
Is the only answer remove one, wait on clean fsck run, remove the next one?
Thanks, St.Ack
-
Re: How to remove three disks from three different nodes in a ten node cluster in less than an hour without losing replicas?
Colin McCabe 2013-02-04, 22:53
It sounds like what you would like is a way to decommission just one storage directory on the DataNode. We don't currently support that.
You might be able to get something approaching this result with "chmod 000 $storage_directory_root". That would at least prevent new blocks from being created on the disk which you don't trust any more. It would also cause the existing blocks to be re-replicated when the DirectoryScanner re-ran and noticed it couldn't get to them. Note that I haven't actually tested the chmod solution, though, so your milage may vary.
best, Colin On Wed, Jan 30, 2013 at 10:34 PM, Stack <[EMAIL PROTECTED]> wrote:
> Here is a little puzzle. > > An admin works for a cash-strapped, popular web shop. At the datacenter > she has a ten node cluster that is heavily used. It runs hot all day long > and decommissioning a node with its background replicating of 12 disks > worth of data messes up the work load she has on top of it and makes her > clients very unhappy. Replicating the data of one node takes at least an > hour. This cluster has three bad disks in three different nodes > (replication factor is 3). The admin lives an hour from the datacenter. > She can't afford a cage monkey and so must replace the disks herself. > > If she left home at 2pm and had to be back by 6pm before the kids came > home from school, how would she replace the three disks without for sure > losing a replica? > > Is the only answer remove one, wait on clean fsck run, remove the next one? > > Thanks, > St.Ack > > > >
|
|
All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by
Sematext