|
|
-
Intelligence of decommission
Allen Wittenauer 2009-08-28, 20:07
Hi.
As I sit here and wait for node decommission to finish, I was wondering about the intelligence of the decision making. [The name nodes, not mine. :) ]
Let's say I have the following scenario:
I have two files. Both files consist of one block with a replication factor of three. I decommission two nodes. File #1 has two of its replicas on the two nodes I am decommissioning. File #2 has only one of its replicas on one of the two nodes I am decommissioning.
Is the block with two replicas on the two nodes I am decommissioning given priority? How does the name node decide which blocks to re-replicate first?
Thanks.
-
Re: Intelligence of decommission
Dhruba Borthakur 2009-08-28, 20:41
when a user issues the decommission command, all the blocks that are currently residing on it are inserted into the to-be-replicated queue. Then the ReplicationMonitor inside the namenode starts replicating these blocks (during this period, the replica on the machine being decommissioned is used for reads, but is not considered a valid replica by the ReplicationMonitor).
hope this helps, dhruba On Fri, Aug 28, 2009 at 1:07 PM, Allen Wittenauer <[EMAIL PROTECTED]>wrote:
> Hi. > > As I sit here and wait for node decommission to finish, I was wondering > about the intelligence of the decision making. [The name nodes, not mine. > :) ] > > Let's say I have the following scenario: > > I have two files. Both files consist of one block with a replication > factor > of three. I decommission two nodes. File #1 has two of its replicas on > the > two nodes I am decommissioning. File #2 has only one of its replicas on > one > of the two nodes I am decommissioning. > > Is the block with two replicas on the two nodes I am decommissioning given > priority? How does the name node decide which blocks to re-replicate > first? > > Thanks. > >
-
Re: Intelligence of decommission
Allen Wittenauer 2009-08-28, 20:58
On 8/28/09 1:41 PM, "Dhruba Borthakur" <[EMAIL PROTECTED]> wrote: > when a user issues the decommission command, all the blocks that are > currently residing on it are inserted into the to-be-replicated queue. Then > the ReplicationMonitor inside the namenode starts replicating these blocks > (during this period, the replica on the machine being decommissioned is used > for reads, but is not considered a valid replica by the ReplicationMonitor).
OK, so it sounds like their is no real ordering of the blocks in the to-be-replicated pool.
It follows then that even during normal operation, there is a scary edge case risk when a block is down to one replica and a decommission is triggered. While the name node is busy replicating blocks that can be fetched from multiple sources, any file that suddenly finds itself to one block may end up corrupted if that single replica somehow gets lost (node crash, whatever).
I guess I'll file a JIRA to make replication smarter. There probably should be queues based on # of replicas vs. expected # of replicas. This way higher risk blocks are replicated first.
-
Re: Intelligence of decommission
Dhruba Borthakur 2009-08-28, 23:11
The ReplicationMonitor inserts the blocks-to-be-replicated in some sort of priority-queue based on the existing number of replicas of the block and their rack locations. So, blocks that have only one replica will get replicated than other blocks.
thanks, dhruba
On Fri, Aug 28, 2009 at 1:58 PM, Allen Wittenauer <[EMAIL PROTECTED]>wrote:
> On 8/28/09 1:41 PM, "Dhruba Borthakur" <[EMAIL PROTECTED]> wrote: > > when a user issues the decommission command, all the blocks that are > > currently residing on it are inserted into the to-be-replicated queue. > Then > > the ReplicationMonitor inside the namenode starts replicating these > blocks > > (during this period, the replica on the machine being decommissioned is > used > > for reads, but is not considered a valid replica by the > ReplicationMonitor). > > OK, so it sounds like their is no real ordering of the blocks in the > to-be-replicated pool. > > It follows then that even during normal operation, there is a scary edge > case risk when a block is down to one replica and a decommission is > triggered. While the name node is busy replicating blocks that can be > fetched from multiple sources, any file that suddenly finds itself to one > block may end up corrupted if that single replica somehow gets lost (node > crash, whatever). > > I guess I'll file a JIRA to make replication smarter. There probably > should > be queues based on # of replicas vs. expected # of replicas. This way > higher risk blocks are replicated first. > >
|
|