|
felix gao
2011-04-12, 14:46
Ayon Sinha
2011-04-12, 14:52
felix gao
2011-04-12, 15:02
felix gao
2011-04-12, 15:05
Ayon Sinha
2011-04-12, 15:11
Marcos Ortiz
2011-04-12, 16:13
Harsh J
2011-04-12, 16:17
felix gao
2011-04-12, 16:30
Matthew Foley
2011-04-12, 17:09
Boris Shkolnik
2011-04-12, 17:20
|
-
Question regarding datanode been wiped by hadoopfelix gao 2011-04-12, 14:46
What reason/condition would cause a datanode’s blocks to be removed? Our
cluster had a one of its datanodes crash because of bad RAM. After the system was upgraded and the datanode/tasktracker brought online the next day we noticed the amount of space utilized was minimal and the cluster was rebalancing blocks to the datanode. It would seem the prior blocks were removed. Was this because the datanode was declared dead? What is the criteria for a namenode to decide (Assuming its the namenode) when a datanode should remove prior blocks?
-
Re: Question regarding datanode been wiped by hadoopAyon Sinha 2011-04-12, 14:52
The datanode used the dfs config xml file to tell the datanode process, what
disks are available for storage. Can you check that the config xml has all the partitions mentioned and has not been overwritten during the restore process? -Ayon See My Photos on Flickr Also check out my Blog for answers to commonly asked questions. ________________________________ From: felix gao <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tue, April 12, 2011 7:46:31 AM Subject: Question regarding datanode been wiped by hadoop What reason/condition would cause a datanode’s blocks to be removed? Our cluster had a one of its datanodes crash because of bad RAM. After the system was upgraded and the datanode/tasktracker brought online the next day we noticed the amount of space utilized was minimal and the cluster was rebalancing blocks to the datanode. It would seem the prior blocks were removed. Was this because the datanode was declared dead? What is the criteria for a namenode to decide (Assuming its the namenode) when a datanode should remove prior blocks?
-
Re: Question regarding datanode been wiped by hadoopfelix gao 2011-04-12, 15:02
The xml files have not been changed for more than two months, so that should
not be the reason. Even the in_use.lock is more than a month old. However, we did shut it down few days ago and restarted it afterward. Then the second shutdown might not be clean. On Tue, Apr 12, 2011 at 7:52 AM, Ayon Sinha <[EMAIL PROTECTED]> wrote: > The datanode used the dfs config xml file to tell the datanode process, > what disks are available for storage. Can you check that the config xml has > all the partitions mentioned and has not been overwritten during the restore > process? > > -Ayon > See My Photos on Flickr <http://www.flickr.com/photos/ayonsinha/> > Also check out my Blog for answers to commonly asked questions.<http://dailyadvisor.blogspot.com> > > > ------------------------------ > *From:* felix gao <[EMAIL PROTECTED]> > *To:* [EMAIL PROTECTED] > *Sent:* Tue, April 12, 2011 7:46:31 AM > *Subject:* Question regarding datanode been wiped by hadoop > > What reason/condition would cause a datanode’s blocks to be removed? Our > cluster had a one of its datanodes crash because of bad RAM. After the > system was upgraded and the datanode/tasktracker brought online the next day > we noticed the amount of space utilized was minimal and the cluster was > rebalancing blocks to the datanode. It would seem the prior blocks were > removed. Was this because the datanode was declared dead? What is the > criteria for a namenode to decide (Assuming its the namenode) when a > datanode should remove prior blocks? >
-
Re: Question regarding datanode been wiped by hadoopfelix gao 2011-04-12, 15:05
>From the timestamp point of view, the only directory seems to be modified
and removed is the Current directory under dfs.home.dir. However, the storage file under dfs.home.dir is untouched since the datanode started. On Tue, Apr 12, 2011 at 8:02 AM, felix gao <[EMAIL PROTECTED]> wrote: > The xml files have not been changed for more than two months, so that > should not be the reason. Even the in_use.lock is more than a month old. > However, we did shut it down few days ago and restarted it afterward. Then > the second shutdown might not be clean. > > > On Tue, Apr 12, 2011 at 7:52 AM, Ayon Sinha <[EMAIL PROTECTED]> wrote: > >> The datanode used the dfs config xml file to tell the datanode process, >> what disks are available for storage. Can you check that the config xml has >> all the partitions mentioned and has not been overwritten during the restore >> process? >> >> -Ayon >> See My Photos on Flickr <http://www.flickr.com/photos/ayonsinha/> >> Also check out my Blog for answers to commonly asked questions.<http://dailyadvisor.blogspot.com> >> >> >> ------------------------------ >> *From:* felix gao <[EMAIL PROTECTED]> >> *To:* [EMAIL PROTECTED] >> *Sent:* Tue, April 12, 2011 7:46:31 AM >> *Subject:* Question regarding datanode been wiped by hadoop >> >> What reason/condition would cause a datanode’s blocks to be removed? Our >> cluster had a one of its datanodes crash because of bad RAM. After the >> system was upgraded and the datanode/tasktracker brought online the next day >> we noticed the amount of space utilized was minimal and the cluster was >> rebalancing blocks to the datanode. It would seem the prior blocks were >> removed. Was this because the datanode was declared dead? What is the >> criteria for a namenode to decide (Assuming its the namenode) when a >> datanode should remove prior blocks? >> > >
-
Re: Question regarding datanode been wiped by hadoopAyon Sinha 2011-04-12, 15:11
If you've only lost a few partitions on a data node and no loss of complete
files due to replicated blocks, then I'd wipe the dfs.data.dir partitions and rebalance. It can get time-consuming to find the exact reason why the data blocks got removed. -Ayon See My Photos on Flickr Also check out my Blog for answers to commonly asked questions. ________________________________ From: felix gao <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tue, April 12, 2011 8:05:43 AM Subject: Re: Question regarding datanode been wiped by hadoop From the timestamp point of view, the only directory seems to be modified and removed is the Current directory under dfs.home.dir. However, the storage file under dfs.home.dir is untouched since the datanode started. On Tue, Apr 12, 2011 at 8:02 AM, felix gao <[EMAIL PROTECTED]> wrote: The xml files have not been changed for more than two months, so that should not be the reason. Even the in_use.lock is more than a month old. However, we did shut it down few days ago and restarted it afterward. Then the second shutdown might not be clean. > > > >On Tue, Apr 12, 2011 at 7:52 AM, Ayon Sinha <[EMAIL PROTECTED]> wrote: > >The datanode used the dfs config xml file to tell the datanode process, what >disks are available for storage. Can you check that the config xml has all the >partitions mentioned and has not been overwritten during the restore process? >> -Ayon >>See My Photos on Flickr >>Also check out my Blog for answers to commonly asked questions. >> >> >> >> >> >> ________________________________ From: felix gao <[EMAIL PROTECTED]> >>To: [EMAIL PROTECTED] >>Sent: Tue, April 12, 2011 7:46:31 AM >>Subject: Question regarding datanode been wiped by hadoop >> >> >> >>What reason/condition would cause a datanode’s blocks to be removed? Our >>cluster had a one of its datanodes crash because of bad RAM. After the system >>was upgraded and the datanode/tasktracker brought online the next day we noticed >>the amount of space utilized was minimal and the cluster was rebalancing blocks >>to the datanode. It would seem the prior blocks were removed. Was this >>because the datanode was declared dead? What is the criteria for a namenode to >>decide (Assuming its the namenode) when a datanode should remove prior blocks? >
-
Re: Question regarding datanode been wiped by hadoopMarcos Ortiz 2011-04-12, 16:13
El 4/12/2011 10:46 AM, felix gao escribi�:
> > What reason/condition would cause a datanode�s blocks to be removed? > Our cluster had a one of its datanodes crash because of bad RAM. > After the system was upgraded and the datanode/tasktracker brought > online the next day we noticed the amount of space utilized was > minimal and the cluster was rebalancing blocks to the datanode. It > would seem the prior blocks were removed. Was this because the > datanode was declared dead? What is the criteria for a namenode to > decide (Assuming its the namenode) when a datanode should remove prior > blocks? > 1- Did you check the DataNode�s logs? 2- Did you protect the NameNode�s dfs.name.dir and the dfs.edits.dir �s directories? On these directories, the NameNode stores the file system image and the second is where the edit log or journal is written. A good practice for these directories is to have them on RAID 1 or RAID 10 to guarantize the consistency of your cluster. Any data loss in these directories (dfs.name.dir and dfs.edits.dir) will result in a loss of data in your HDFS. So, the second good practice is to have a secondary NameNode to setup in any case that the primary NameNode fails. Another thing to keep in mind, is that when the NameNode fails, you have to restar the JobTracker and the TaskTrackers after that the NameNode will be restarted. Regards -- Marcos Lu�s Ort�z Valmaseda Software Engineer (Large-Scaled Distributed Systems) University of Information Sciences, La Habana, Cuba Linux User # 418229
-
Re: Question regarding datanode been wiped by hadoopHarsh J 2011-04-12, 16:17
Apart from ensuring all that others here have said, are your
mapred.local.dir and dfs.data.dir pointing to the same directory by any chance? If that so happens, the tasktracker could potentially wipe out the datanode directories when restarted. On Tue, Apr 12, 2011 at 8:16 PM, felix gao <[EMAIL PROTECTED]> wrote: > What reason/condition would cause a datanode’s blocks to be removed? Our > cluster had a one of its datanodes crash because of bad RAM. After the > system was upgraded and the datanode/tasktracker brought online the next day > we noticed the amount of space utilized was minimal and the cluster was > rebalancing blocks to the datanode. It would seem the prior blocks were > removed. Was this because the datanode was declared dead? What is the > criteria for a namenode to decide (Assuming its the namenode) when a > datanode should remove prior blocks? -- Harsh J
-
Re: Question regarding datanode been wiped by hadoopfelix gao 2011-04-12, 16:30
mapred.local.dir is under /hadoop/mapred defs.data.dir is /hadoop./dfs. The
logs showing 011-04-11 14:34:10,987 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50020: starting 2011-04-11 14:34:10,987 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50020: starting 2011-04-11 14:34:10,988 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50020: starting 2011-04-11 14:34:10,988 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: dnRegistration DatanodeRegistration(had41.xxx:50010, storageID=DS-922075132-69.170.130.173-50010-1297386088418, infoPort=50075, ipcPort=50020) 2011-04-11 14:34:10,988 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 50020: starting 2011-04-11 14:34:11,021 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 69.170.130.173:50010, storageID=DS-922075132-69.170.130.173-50010-1297386088418, infoPort=50075, ipcPort=50020)In DataNode.run, data FSDataset{dirpath='/hadoop/dfs/current'} 2011-04-11 14:34:11,021 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: using BLOCKREPORT_INTERVAL of 3600000msec Initial delay: 0msec 2011-04-11 14:34:15,545 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 110651 blocks got processed in 4493 msecs 2011-04-11 14:34:15,545 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Starting Periodic block scanner. 2011-04-11 14:34:15,692 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Deleting block blk_-9213431071914219029_3395500 file /hadoop/dfs/current/subdir7/subdir26/blk_-9213431071914219029 then followed by all the blocks in the dfs/current directory. Seems to me hadoop just want to invalidate all the blocks in that box by deleting all of it. On Tue, Apr 12, 2011 at 9:17 AM, Harsh J <[EMAIL PROTECTED]> wrote: > Apart from ensuring all that others here have said, are your > mapred.local.dir and dfs.data.dir pointing to the same directory by > any chance? If that so happens, the tasktracker could potentially wipe > out the datanode directories when restarted. > > On Tue, Apr 12, 2011 at 8:16 PM, felix gao <[EMAIL PROTECTED]> wrote: > > What reason/condition would cause a datanode’s blocks to be removed? > Our > > cluster had a one of its datanodes crash because of bad RAM. After the > > system was upgraded and the datanode/tasktracker brought online the next > day > > we noticed the amount of space utilized was minimal and the cluster was > > rebalancing blocks to the datanode. It would seem the prior blocks were > > removed. Was this because the datanode was declared dead? What is the > > criteria for a namenode to decide (Assuming its the namenode) when a > > datanode should remove prior blocks? > > -- > Harsh J >
-
Re: Question regarding datanode been wiped by hadoopMatthew Foley 2011-04-12, 17:09
Here's another hypothesis. You'll have to check your namenode logs to see if it is the correct one.
In a healthy cluster, when one datanode goes down, 10 minutes later the namenode will "notice" it, and mark all its blocks as under-replicated (in namenode memory). It will then generate replication requests to other holders of those blocks, to get back to the required replication count for each. In a large cluster, this process can take as little as a few minutes, because there are many candidate senders and receivers for the copy operations. When the datanode comes back a whole day later and sends its Initial Block Report, which we'll assume still has all the blocks it did before, since the disks were not corrupted, then all those blocks will be marked as OVER-replicated. The namenode will generate delete requests for one replica for each of those blocks. Now, I would have expected those delete requests to be randomly distributed across all nodes holding replicas of those blocks. But there is some indication in the code that all the deletes may go to the new excess source, especially if the number of blocks is small (say, under one thousand). I'm not sure whether this is a sufficient explanation or not. If the block deletes were ordered by the namenode, they may be in the namenode logs, with a prefix "hdfs.StateChange: BLOCK*", like this: 11/04/12 10:00:28 INFO hdfs.StateChange: BLOCK* NameSystem.chooseExcessReplicates: (127.0.0.1:49946, blk_-44847565016350316_1001) is added to recentInvalidateSets 11/04/12 10:00:28 INFO hdfs.StateChange: BLOCK* ask 127.0.0.1:49946 to delete blk_-44847565016350316_1001 Obviously you have to be logging at INFO level for these block-level manipulations to be in your namenode logs. The datanode then echoes it with log lines like: 11/04/12 10:00:29 INFO datanode.DataNode: Scheduling block blk_-44847565016350316_1001 file .../current/finalized/blk_-44847565016350316 for deletion 11/04/12 10:00:29 INFO datanode.DataNode: Deleted block blk_-44847565016350316_1001 at file .../current/finalized/blk_-44847565016350316 Your log extract below shows that the datanode was logging at INFO level. See if your namenode was too, and if you can show that it generated a bunch of delete requests shortly after your repaired datanode came up. Cheers, --Matt On Apr 12, 2011, at 9:30 AM, felix gao wrote: mapred.local.dir is under /hadoop/mapred defs.data.dir is /hadoop./dfs. The logs showing 011-04-11 14:34:10,987 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50020: starting 2011-04-11 14:34:10,987 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50020: starting 2011-04-11 14:34:10,988 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50020: starting 2011-04-11 14:34:10,988 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: dnRegistration = DatanodeRegistration(had41.xxx:50010, storageID=DS-922075132-69.170.130.173-50010-1297386088418, infoPort=50075, ipcPort=50020) 2011-04-11 14:34:10,988 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 50020: starting 2011-04-11 14:34:11,021 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(69.170.130.173:50010<http://69.170.130.173:50010/>, storageID=DS-922075132-69.170.130.173-50010-1297386088418, infoPort=50075, ipcPort=50020)In DataNode.run, data = FSDataset{dirpath='/hadoop/dfs/current'} 2011-04-11 14:34:11,021 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: using BLOCKREPORT_INTERVAL of 3600000msec Initial delay: 0msec 2011-04-11 14:34:15,545 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 110651 blocks got processed in 4493 msecs 2011-04-11 14:34:15,545 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Starting Periodic block scanner. 2011-04-11 14:34:15,692 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Deleting block blk_-9213431071914219029_3395500 file /hadoop/dfs/current/subdir7/subdir26/blk_-9213431071914219029 then followed by all the blocks in the dfs/current directory. Seems to me hadoop just want to invalidate all the blocks in that box by deleting all of it. On Tue, Apr 12, 2011 at 9:17 AM, Harsh J <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Apart from ensuring all that others here have said, are your mapred.local.dir and dfs.data.dir pointing to the same directory by any chance? If that so happens, the tasktracker could potentially wipe out the datanode directories when restarted. On Tue, Apr 12, 2011 at 8:16 PM, felix gao <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Harsh J
-
Re: Question regarding datanode been wiped by hadoopBoris Shkolnik 2011-04-12, 17:20
One thing to consider.. If the node was down for a day all of its blocks could’ve been replicated to other datanodes.
When machine is brought back , these blocks become overreplicated and NameNode decides to delete them. You should check the logs of both DataNode and Namenode to see if it could be the case. Boris. On 4/12/11 7:46 AM, "felix gao" <[EMAIL PROTECTED]> wrote: What reason/condition would cause a datanode’s blocks to be removed? Our cluster had a one of its datanodes crash because of bad RAM. After the system was upgraded and the datanode/tasktracker brought online the next day we noticed the amount of space utilized was minimal and the cluster was rebalancing blocks to the datanode. It would seem the prior blocks were removed. Was this because the datanode was declared dead? What is the criteria for a namenode to decide (Assuming its the namenode) when a datanode should remove prior blocks? |