-Re: Hang when add/remove a datanode into/from a 2 datanode cluster
sam liu 2013-07-02, 06:04
Yes, the default replication factor is 3. However, in my case, it's
strange: during decommission hangs, I found some block's expected replicas
is 3, but the 'dfs.replication' value in hdfs-site.xml of every cluster
node is always 2 from the beginning of cluster setup. Below is my steps:
1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and dn2. And, in
hdfs-site.xml, set the 'dfs.replication' to 2
2. Add node dn3 into the cluster as a new datanode, and did not change the '
dfs.replication' value in hdfs-site.xml and keep it as 2
note: step 2 passed
3. Decommission dn3 from the cluster
Expected result: dn3 could be decommissioned successfully
a). decommission progress hangs and the status always be 'Waiting DataNode
status: Decommissioned'. But, if I execute 'hadoop dfs -setrep -R 2 /', the
decommission continues and will be completed finally.
b). However, if the initial cluster includes >= 3 datanodes, this issue
won't be encountered when add/remove another datanode. For example, if I
setup a cluster with 3 datanodes, and then I can successfully add the 4th
datanode into it, and then also can successfully remove the 4th datanode
from the cluster.
I doubt it's a bug and plan to open a jira to Hadoop HDFS for this. Any
2013/6/21 Harsh J <[EMAIL PROTECTED]>
> The dfs.replication is a per-file parameter. If you have a client that
> does not use the supplied configs, then its default replication is 3
> and all files it will create (as part of the app or via a job config)
> will be with replication factor 3.
> You can do an -lsr to find all files and filter which ones have been
> created with a factor of 3 (versus expected config of 2).
> On Fri, Jun 21, 2013 at 3:13 PM, sam liu <[EMAIL PROTECTED]> wrote:
> > Hi George,
> > Actually, in my hdfs-site.xml, I always set 'dfs.replication'to 2. But
> > encounter this issue.
> > Thanks!
> > 2013/6/21 George Kousiouris <[EMAIL PROTECTED]>
> >> Hi,
> >> I think i have faced this before, the problem is that you have the rep
> >> factor=3 so it seems to hang because it needs 3 nodes to achieve the
> >> (replicas are not created on the same node). If you set the replication
> >> factor=2 i think you will not have this issue. So in general you must
> >> sure that the rep factor is <= to the available datanodes.
> >> BR,
> >> George
> >> On 6/21/2013 12:29 PM, sam liu wrote:
> >> Hi,
> >> I encountered an issue which hangs the decommission operatoin. Its
> >> 1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and dn2. And,
> >> hdfs-site.xml, set the 'dfs.replication' to 2
> >> 2. Add node dn3 into the cluster as a new datanode, and did not change
> >> 'dfs.replication' value in hdfs-site.xml and keep it as 2
> >> note: step 2 passed
> >> 3. Decommission dn3 from the cluster
> >> Expected result: dn3 could be decommissioned successfully
> >> Actual result: decommission progress hangs and the status always be
> >> 'Waiting DataNode status: Decommissioned'
> >> However, if the initial cluster includes >= 3 datanodes, this issue
> >> be encountered when add/remove another datanode.
> >> Also, after step 2, I noticed that some block's expected replicas is 3,
> >> but the 'dfs.replication' value in hdfs-site.xml is always 2!
> >> Could anyone pls help provide some triages?
> >> Thanks in advance!
> >> --
> >> ---------------------------
> >> George Kousiouris, PhD
> >> Electrical and Computer Engineer
> >> Division of Communications,
> >> Electronics and Information Engineering
> >> School of Electrical and Computer Engineering
> >> Tel: +30 210 772 2546
> >> Mobile: +30 6939354121
> >> Fax: +30 210 772 2569
> >> Email: [EMAIL PROTECTED]
> >> Site: http://users.ntua.gr/gkousiou/
> >> National Technical University of Athens
> >> 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece