|
|
-
Is that a good practice?
Mark Kerzner 2011-03-03, 23:44
Hi,
in my small development cluster I have a master/slave node and a slave node, and I shut down the slave node at night. I often see that my HDFS is corrupted, and I have to reformat the name node and to delete the data directory.
It finally dawns on me that with such small cluster I better shut the daemons down, for otherwise they are trying too hard to compensate for the missing node and eventually it goes bad. Is my understanding correct?
Thank you, Mark
-
Re: Is that a good practice?
Eric Sammer 2011-03-03, 23:55
On Thu, Mar 3, 2011 at 6:44 PM, Mark Kerzner <[EMAIL PROTECTED]> wrote:
> Hi, > > in my small development cluster I have a master/slave node and a slave > node, > and I shut down the slave node at night. I often see that my HDFS is > corrupted, and I have to reformat the name node and to delete the data > directory. >
Why do you shut down the slave at night? HDFS should only be corrupted if you're missing all copies of a block. With a replication factor of 3 (default) you should have 100% of the data on both nodes (if you only have 2 nodes). If you've dialed it down to 1, simply starting the slave back up should "un-corrupt" HDFS. You definitely don't want to be doing this to HDFS regularly (dropping nodes from the cluster and re-adding them unless you're trying to test HDFS' failure semantics.
It finally dawns on me that with such small cluster I better shut the > daemons down, for otherwise they are trying too hard to compensate for the > missing node and eventually it goes bad. Is my understanding correct? >
It doesn't "eventually go bad." If the NN sees a DN disappear it may start re-replicating data to another node. In such a small cluster, maybe there's no where else to get the blocks from, but I bet you dialed the replication factor down to 1 (or have code that writes files with a rep factor of 1 like teragen / terasort).
In short, if you're going to shut down nodes like this put the NN into safe mode so it doesn't freak out (which will also make the cluster unusable during that time) but there's definitely no need to be reformatting HDFS. Just re-introduce the DN you shut down to the cluster. > > Thank you, > Mark >
-- Eric Sammer twitter: esammer data: www.cloudera.com
-
Re: Is that a good practice?
Mark Kerzner 2011-03-04, 00:02
Eric,
I shut it down at night, because the slave server is in my bedroom, and I use the replication factor of 1, because that is what my CDH install did, so I accepted it. I will bump it up to 3.
But the most important advice that you give is "put it into safe mode" - and that is what I am going to do all the time that I am not working on it, because it is purely my development cluster. I might even shut the daemons down completely.
Thank you, Mark
On Thu, Mar 3, 2011 at 5:55 PM, Eric Sammer <[EMAIL PROTECTED]> wrote:
> On Thu, Mar 3, 2011 at 6:44 PM, Mark Kerzner <[EMAIL PROTECTED]>wrote: > >> Hi, >> >> in my small development cluster I have a master/slave node and a slave >> node, >> and I shut down the slave node at night. I often see that my HDFS is >> corrupted, and I have to reformat the name node and to delete the data >> directory. >> > > Why do you shut down the slave at night? HDFS should only be corrupted if > you're missing all copies of a block. With a replication factor of 3 > (default) you should have 100% of the data on both nodes (if you only have 2 > nodes). If you've dialed it down to 1, simply starting the slave back up > should "un-corrupt" HDFS. You definitely don't want to be doing this to HDFS > regularly (dropping nodes from the cluster and re-adding them unless you're > trying to test HDFS' failure semantics. > > It finally dawns on me that with such small cluster I better shut the >> daemons down, for otherwise they are trying too hard to compensate for the >> missing node and eventually it goes bad. Is my understanding correct? >> > > It doesn't "eventually go bad." If the NN sees a DN disappear it may start > re-replicating data to another node. In such a small cluster, maybe there's > no where else to get the blocks from, but I bet you dialed the replication > factor down to 1 (or have code that writes files with a rep factor of 1 like > teragen / terasort). > > In short, if you're going to shut down nodes like this put the NN into safe > mode so it doesn't freak out (which will also make the cluster unusable > during that time) but there's definitely no need to be reformatting HDFS. > Just re-introduce the DN you shut down to the cluster. > > >> >> Thank you, >> Mark >> > > -- > Eric Sammer > twitter: esammer > data: www.cloudera.com >
-
Re: Is that a good practice?
Harsh J 2011-03-04, 04:05
This appears to me like the simple case of your OS clearing out your /tmp at every boot. You will lose all data + fsimage if you haven't configured your dfs.data.dir and dfs.name.dir to not be located on /tmp.
On Fri, Mar 4, 2011 at 5:14 AM, Mark Kerzner <[EMAIL PROTECTED]> wrote: > It finally dawns on me that with such small cluster I better shut the > daemons down, for otherwise they are trying too hard to compensate for the > missing node and eventually it goes bad. Is my understanding correct?
Are your dfs.data.dir and/or dfs.name.dir properties pointing to locations on /tmp, by the way? This appears to me like the simple case of your OS clearing out your /tmp on boot. You will lose all data + fsimage this way if you haven't configured your dfs.data.dir and dfs.name.dir to not be located on /tmp.
-- Harsh J www.harshj.com
-
Re: Is that a good practice?
Mark Kerzner 2011-03-04, 04:32
Harsh,
indeed, this has bitten me a while back, but now the default Cloudera distribution configures them outside of /tmp
Really, as Eric has pointed out, I was making failure a regular occasion, by bringing one computer down.
Thank you, Mark
On Thu, Mar 3, 2011 at 10:05 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> This appears to me like the simple case of your OS clearing out your > /tmp at every boot. You will lose all data + fsimage if you haven't > configured your dfs.data.dir and dfs.name.dir to not be located on > /tmp. > > On Fri, Mar 4, 2011 at 5:14 AM, Mark Kerzner <[EMAIL PROTECTED]> > wrote: > > It finally dawns on me that with such small cluster I better shut the > > daemons down, for otherwise they are trying too hard to compensate for > the > > missing node and eventually it goes bad. Is my understanding correct? > > Are your dfs.data.dir and/or dfs.name.dir properties pointing to > locations on /tmp, by the way? This appears to me like the simple case > of your OS clearing out your /tmp on boot. You will lose all data + > fsimage this way if you haven't configured your dfs.data.dir and > dfs.name.dir to not be located on /tmp. > > -- > Harsh J > www.harshj.com >
|
|