Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: Hang when add/remove a datanode into/from a 2 datanode cluster


Copy link to this message
-
Re: Hang when add/remove a datanode into/from a 2 datanode cluster
Yes, you are correct: using fsck tool I found some files in my cluster
expected more replications than the value defined in dfs.replication. If I
set the expected replication of this files to a proper number, the
decommissioning process will go smoothly and the datanode could be
decommissioned finally.

However, many users do not mention this and might confuse the situation
that cluster always stays on decommissioning phase, so I think we can make
some improvements to allow user more easily to do precheck before
decommisioning datanode here, and help them find out all filies which might
lack of replications after decommissioning a datanode. For example, we can
told user that the expected replication of file1 and file26 are 6, but
after decommissioning a datanode the datanodes of the cluster will be 5 and
won't satisify file1 and file26 any more. In this way, user can decide
whether to continue the decommissioning work or to reduce the expected
replications of those files. For the way to implementation, I think we
could add a decommission-precheck script or a parameter to fsck tool.

Any comments?
2013/8/1 Harsh J <[EMAIL PROTECTED]>

> As I said before, it is a per-file property and the config can be
> bypassed by clients that do not read the configs, place a manual API
> override, etc..
>
> If you want to really define a hard maximum and catch such clients,
> try setting dfs.replication.max to 2 at your NameNode.
>
> On Thu, Aug 1, 2013 at 8:07 AM, sam liu <[EMAIL PROTECTED]> wrote:
> > But, please mention that the value of 'dfs.replication' of the cluster is
> > always 2, even when the datanode number is 3. And I am pretty sure I did
> not
> > manually create any files with rep=3. So, why were some files of hdfs
> > created with repl=3, but not repl=2?
> >
> >
> > 2013/8/1 Harsh J <[EMAIL PROTECTED]>
> >>
> >> The step (a) points to your problem and solution both. You have files
> >> being created with repl=3 on a 2 DN cluster which will prevent
> >> decommission. This is not a bug.
> >>
> >> On Wed, Jul 31, 2013 at 12:09 PM, sam liu <[EMAIL PROTECTED]>
> wrote:
> >> > I opened a jira for tracking this issue:
> >> > https://issues.apache.org/jira/browse/HDFS-5046
> >> >
> >> >
> >> > 2013/7/2 sam liu <[EMAIL PROTECTED]>
> >> >>
> >> >> Yes, the default replication factor is 3. However, in my case, it's
> >> >> strange: during decommission hangs, I found some block's expected
> >> >> replicas
> >> >> is 3, but the 'dfs.replication' value in hdfs-site.xml of every
> cluster
> >> >> node
> >> >> is always 2 from the beginning of cluster setup. Below is my steps:
> >> >>
> >> >> 1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and dn2.
> And,
> >> >> in
> >> >> hdfs-site.xml, set the 'dfs.replication' to 2
> >> >> 2. Add node dn3 into the cluster as a new datanode, and did not
> change
> >> >> the
> >> >> 'dfs.replication' value in hdfs-site.xml and keep it as 2
> >> >> note: step 2 passed
> >> >> 3. Decommission dn3 from the cluster
> >> >> Expected result: dn3 could be decommissioned successfully
> >> >> Actual result:
> >> >> a). decommission progress hangs and the status always be 'Waiting
> >> >> DataNode
> >> >> status: Decommissioned'. But, if I execute 'hadoop dfs -setrep -R 2
> /',
> >> >> the
> >> >> decommission continues and will be completed finally.
> >> >> b). However, if the initial cluster includes >= 3 datanodes, this
> issue
> >> >> won't be encountered when add/remove another datanode. For example,
> if
> >> >> I
> >> >> setup a cluster with 3 datanodes, and then I can successfully add the
> >> >> 4th
> >> >> datanode into it, and then also can successfully remove the 4th
> >> >> datanode
> >> >> from the cluster.
> >> >>
> >> >> I doubt it's a bug and plan to open a jira to Hadoop HDFS for this.
> Any
> >> >> comments?
> >> >>
> >> >> Thanks!
> >> >>
> >> >>
> >> >> 2013/6/21 Harsh J <[EMAIL PROTECTED]>
> >> >>>
> >> >>> The dfs.replication is a per-file parameter. If you have a client