|
|
David Charle 2012-06-22, 03:17
What is the best practice to remove a node and add the same node back for hbase/hadoop ?
Currently in our 10 node cluster; 2 nodes went down (bad disk, so node is down as its the root volume+data); need to replace the disk and add them back. Any quick suggestions or pointers to doc for the right procedure ?
-- David
Michael Segel 2012-06-22, 03:35
Assuming that you have an Apache release (Apache, HW, Cloudera) ... (If MapR, replace the drive and you should be able to repair the cluster from the console. Node doesn't go down. ) Node goes down. 10 min later, cluster sees node down. Should then be able to replicate the missing blocks.
Replace disk w new disk and rebuild file system. Bring node up. Rebalance cluster.
That should be pretty much it. On Jun 21, 2012, at 10:17 PM, David Charle wrote:
> What is the best practice to remove a node and add the same node back for > hbase/hadoop ? > > Currently in our 10 node cluster; 2 nodes went down (bad disk, so node is > down as its the root volume+data); need to replace the disk and add them > back. Any quick suggestions or pointers to doc for the right procedure ? > > -- > David
David Charle 2012-06-22, 13:40
Thanks Michael.
Does that mean, do I need to exclude the node first and then add it or simply bring the node back; and hadoop/hbase will rebuild the missing blocks (as data is speed on multiple drives, one drive is dead, so 1/4th of data ~300G). Or we need to do any special fs check to ensure the missing data is replicated first ?
On Thu, Jun 21, 2012 at 8:35 PM, Michael Segel <[EMAIL PROTECTED]>wrote:
> Assuming that you have an Apache release (Apache, HW, Cloudera) ... > (If MapR, replace the drive and you should be able to repair the cluster > from the console. Node doesn't go down. ) > Node goes down. > 10 min later, cluster sees node down. Should then be able to replicate the > missing blocks. > > Replace disk w new disk and rebuild file system. > Bring node up. > Rebalance cluster. > > That should be pretty much it. > > > On Jun 21, 2012, at 10:17 PM, David Charle wrote: > > > What is the best practice to remove a node and add the same node back for > > hbase/hadoop ? > > > > Currently in our 10 node cluster; 2 nodes went down (bad disk, so node is > > down as its the root volume+data); need to replace the disk and add them > > back. Any quick suggestions or pointers to doc for the right procedure ? > > > > -- > > David > >
Tom Brown 2012-06-22, 13:41
Can it notice the node is down sooner? If that node is serving an active region (or if it's a datanode for an active region), that would be a potentially large amount of downtime. With comodity hardware, and a large enough cluster, there will always be a machine or two being rebuilt...
Thanks!
-Tom
On Thursday, June 21, 2012, Michael Segel wrote:
> Assuming that you have an Apache release (Apache, HW, Cloudera) ... > (If MapR, replace the drive and you should be able to repair the cluster > from the console. Node doesn't go down. ) > Node goes down. > 10 min later, cluster sees node down. Should then be able to replicate the > missing blocks. > > Replace disk w new disk and rebuild file system. > Bring node up. > Rebalance cluster. > > That should be pretty much it. > > > On Jun 21, 2012, at 10:17 PM, David Charle wrote: > > > What is the best practice to remove a node and add the same node back for > > hbase/hadoop ? > > > > Currently in our 10 node cluster; 2 nodes went down (bad disk, so node is > > down as its the root volume+data); need to replace the disk and add them > > back. Any quick suggestions or pointers to doc for the right procedure ? > > > > -- > > David > >
David Charle 2012-06-22, 13:58
Hi Tom
On Fri, Jun 22, 2012 at 6:41 AM, Tom Brown <[EMAIL PROTECTED]> wrote:
> Can it notice the node is down sooner? If that node is serving an active > region (or if it's a datanode for an active region), that would be a > potentially large amount of downtime. With comodity hardware, and a large > enough cluster, there will always be a machine or two being rebuilt... > I still see 4 live serves and 0 dead servers out of 5 even the other node (processes) are down for more than 24 hrs. > Thanks! > > -Tom > > On Thursday, June 21, 2012, Michael Segel wrote: > > > Assuming that you have an Apache release (Apache, HW, Cloudera) ... > > (If MapR, replace the drive and you should be able to repair the cluster > > from the console. Node doesn't go down. ) > > Node goes down. > > 10 min later, cluster sees node down. Should then be able to replicate > the > > missing blocks. > > > > Replace disk w new disk and rebuild file system. > > Bring node up. > > Rebalance cluster. > > > > That should be pretty much it. > > > > > > On Jun 21, 2012, at 10:17 PM, David Charle wrote: > > > > > What is the best practice to remove a node and add the same node back > for > > > hbase/hadoop ? > > > > > > Currently in our 10 node cluster; 2 nodes went down (bad disk, so node > is > > > down as its the root volume+data); need to replace the disk and add > them > > > back. Any quick suggestions or pointers to doc for the right procedure > ? > > > > > > -- > > > David > > > > >
Dave Barr 2012-06-22, 18:20
Don't bother excluding the host first if the host is already considered dead in the namenode. Which version of Hadoop? In Hadoop versions nearer to 1.0.0 you will see "Number of Under-Replicated Blocks" in the namenode web UI as it's re-replicating data when a node is down (it will go away when it's done). If not, you can see it in hadoop fsck / . If a datanode is down long enough that the data has been rereplicated, if you bring the DN back online you will actually have those remaining blocks go into an over-replicated state (you'll see it in hadoop fsck /). I'm not sure exactly what handles this in the NN, but the NN will clean up over-replicated blocks over time. For rebalancing volume data within a datanode, it will eventually work itself out though over time, as the DN allocates new blocks to the least full volume. The FAQ lists a hack to move data manually, but I personally have never tried it: http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F--Dave On Fri, Jun 22, 2012 at 6:40 AM, David Charle <[EMAIL PROTECTED]> wrote: > Thanks Michael. > > Does that mean, do I need to exclude the node first and then add it or > simply bring the node back; and hadoop/hbase will rebuild the missing > blocks (as data is speed on multiple drives, one drive is dead, so 1/4th of > data ~300G). Or we need to do any special fs check to ensure the missing > data is replicated first ? > > On Thu, Jun 21, 2012 at 8:35 PM, Michael Segel <[EMAIL PROTECTED]>wrote: > >> Assuming that you have an Apache release (Apache, HW, Cloudera) ... >> (If MapR, replace the drive and you should be able to repair the cluster >> from the console. Node doesn't go down. ) >> Node goes down. >> 10 min later, cluster sees node down. Should then be able to replicate the >> missing blocks. >> >> Replace disk w new disk and rebuild file system. >> Bring node up. >> Rebalance cluster. >> >> That should be pretty much it. >> >> >> On Jun 21, 2012, at 10:17 PM, David Charle wrote: >> >> > What is the best practice to remove a node and add the same node back for >> > hbase/hadoop ? >> > >> > Currently in our 10 node cluster; 2 nodes went down (bad disk, so node is >> > down as its the root volume+data); need to replace the disk and add them >> > back. Any quick suggestions or pointers to doc for the right procedure ? >> > >> > -- >> > David >> >>
Michel Segel 2012-06-25, 03:14
You don't notice it faster, it's the timeout. You can reduce the timeout, it's configurable. Default is 10 min.
There shouldn't be downtime of the cluster, just the node.
Note this is for Apache. MapR is different and someone from MapR should be able to provide details...
Sent from a remote device. Please excuse any typos...
Mike Segel
On Jun 22, 2012, at 8:41 AM, Tom Brown <[EMAIL PROTECTED]> wrote:
> Can it notice the node is down sooner? If that node is serving an active > region (or if it's a datanode for an active region), that would be a > potentially large amount of downtime. With comodity hardware, and a large > enough cluster, there will always be a machine or two being rebuilt... > > Thanks! > > -Tom > > On Thursday, June 21, 2012, Michael Segel wrote: > >> Assuming that you have an Apache release (Apache, HW, Cloudera) ... >> (If MapR, replace the drive and you should be able to repair the cluster >> from the console. Node doesn't go down. ) >> Node goes down. >> 10 min later, cluster sees node down. Should then be able to replicate the >> missing blocks. >> >> Replace disk w new disk and rebuild file system. >> Bring node up. >> Rebalance cluster. >> >> That should be pretty much it. >> >> >> On Jun 21, 2012, at 10:17 PM, David Charle wrote: >> >>> What is the best practice to remove a node and add the same node back for >>> hbase/hadoop ? >>> >>> Currently in our 10 node cluster; 2 nodes went down (bad disk, so node is >>> down as its the root volume+data); need to replace the disk and add them >>> back. Any quick suggestions or pointers to doc for the right procedure ? >>> >>> -- >>> David >> >>
M. C. Srivas 2012-07-09, 05:04
On Sun, Jun 24, 2012 at 8:14 PM, Michel Segel <[EMAIL PROTECTED]>wrote:
> You don't notice it faster, it's the timeout. > You can reduce the timeout, it's configurable. Default is 10 min. > > There shouldn't be downtime of the cluster, just the node. > > Note this is for Apache. MapR is different and someone from MapR should be > able to provide details... >
No downtime for MapR ... the failed drive is detected in 30 seconds or so (if the controller is jammed, Linux takes about 2 mins to "un-hang" the entire system, so it could be as much as that). The drive can be pulled out and a new one inserted while the system is live. Mapr will automatically reformat and start using the newly added drive in under 1 min.
While you are fetching the replacement drive, the data that was on the bad drive is immediately rebuilt and redistributed automatically. > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Jun 22, 2012, at 8:41 AM, Tom Brown <[EMAIL PROTECTED]> wrote: > > > Can it notice the node is down sooner? If that node is serving an active > > region (or if it's a datanode for an active region), that would be a > > potentially large amount of downtime. With comodity hardware, and a > large > > enough cluster, there will always be a machine or two being rebuilt... > > > > Thanks! > > > > -Tom > > > > On Thursday, June 21, 2012, Michael Segel wrote: > > > >> Assuming that you have an Apache release (Apache, HW, Cloudera) ... > >> (If MapR, replace the drive and you should be able to repair the cluster > >> from the console. Node doesn't go down. ) > >> Node goes down. > >> 10 min later, cluster sees node down. Should then be able to replicate > the > >> missing blocks. > >> > >> Replace disk w new disk and rebuild file system. > >> Bring node up. > >> Rebalance cluster. > >> > >> That should be pretty much it. > >> > >> > >> On Jun 21, 2012, at 10:17 PM, David Charle wrote: > >> > >>> What is the best practice to remove a node and add the same node back > for > >>> hbase/hadoop ? > >>> > >>> Currently in our 10 node cluster; 2 nodes went down (bad disk, so node > is > >>> down as its the root volume+data); need to replace the disk and add > them > >>> back. Any quick suggestions or pointers to doc for the right procedure > ? > >>> > >>> -- > >>> David > >> > >> >
Kevin O'dell 2012-07-09, 12:37
Depending on your setup(not MapR) you can also raise your allowed failed volumes this will let you keep your nodes up until you are ready to replace the single bad drive.
On Mon, Jul 9, 2012 at 1:04 AM, M. C. Srivas <[EMAIL PROTECTED]> wrote:
> On Sun, Jun 24, 2012 at 8:14 PM, Michel Segel <[EMAIL PROTECTED] > >wrote: > > > You don't notice it faster, it's the timeout. > > You can reduce the timeout, it's configurable. Default is 10 min. > > > > There shouldn't be downtime of the cluster, just the node. > > > > Note this is for Apache. MapR is different and someone from MapR should > be > > able to provide details... > > > > No downtime for MapR ... the failed drive is detected in 30 seconds or so > (if the controller is jammed, Linux takes about 2 mins to "un-hang" the > entire system, so it could be as much as that). The drive can be pulled > out and a new one inserted while the system is live. Mapr will > automatically reformat and start using the newly added drive in under 1 > min. > > While you are fetching the replacement drive, the data that was on the bad > drive is immediately rebuilt and redistributed automatically. > > > > > > > > Sent from a remote device. Please excuse any typos... > > > > Mike Segel > > > > On Jun 22, 2012, at 8:41 AM, Tom Brown <[EMAIL PROTECTED]> wrote: > > > > > Can it notice the node is down sooner? If that node is serving an > active > > > region (or if it's a datanode for an active region), that would be a > > > potentially large amount of downtime. With comodity hardware, and a > > large > > > enough cluster, there will always be a machine or two being rebuilt... > > > > > > Thanks! > > > > > > -Tom > > > > > > On Thursday, June 21, 2012, Michael Segel wrote: > > > > > >> Assuming that you have an Apache release (Apache, HW, Cloudera) ... > > >> (If MapR, replace the drive and you should be able to repair the > cluster > > >> from the console. Node doesn't go down. ) > > >> Node goes down. > > >> 10 min later, cluster sees node down. Should then be able to replicate > > the > > >> missing blocks. > > >> > > >> Replace disk w new disk and rebuild file system. > > >> Bring node up. > > >> Rebalance cluster. > > >> > > >> That should be pretty much it. > > >> > > >> > > >> On Jun 21, 2012, at 10:17 PM, David Charle wrote: > > >> > > >>> What is the best practice to remove a node and add the same node back > > for > > >>> hbase/hadoop ? > > >>> > > >>> Currently in our 10 node cluster; 2 nodes went down (bad disk, so > node > > is > > >>> down as its the root volume+data); need to replace the disk and add > > them > > >>> back. Any quick suggestions or pointers to doc for the right > procedure > > ? > > >>> > > >>> -- > > >>> David > > >> > > >> > > >
-- Kevin O'Dell Customer Operations Engineer, Cloudera
|
|