|
|
Robert Dyer 2013-02-16, 20:38
I am at a bit of wits end here. Every single time I restart the namenode, I get this crash:
2013-02-16 14:32:42,616 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058 loaded in 0 seconds. 2013-02-16 14:32:42,618 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
I am following best practices here, as far as I know. I have the namenode writing into 3 directories (2 local, 1 NFS). All 3 of these dirs have the exact same files in them.
I also run a secondary checkpoint node. This one appears to have started failing a week ago. So checkpoints were *not* being done since then. Thus I can get the NN up and running, but with a week old data!
What is going on here? Why does my NN data *always* wind up causing this exception over time? Is there some easy way to get notified when the checkpointing starts to fail?
+
Robert Dyer 2013-02-16, 20:38
Robert Dyer 2013-02-16, 20:39
Forgot to mention: Hadoop 1.0.4 On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <[EMAIL PROTECTED]> wrote:
> I am at a bit of wits end here. Every single time I restart the namenode, > I get this crash: > > 2013-02-16 14:32:42,616 INFO org.apache.hadoop.hdfs.server.common.Storage: > Image file of size 168058 loaded in 0 seconds. > 2013-02-16 14:32:42,618 ERROR > org.apache.hadoop.hdfs.server.namenode.NameNode: > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288) > > I am following best practices here, as far as I know. I have the namenode > writing into 3 directories (2 local, 1 NFS). All 3 of these dirs have the > exact same files in them. > > I also run a secondary checkpoint node. This one appears to have started > failing a week ago. So checkpoints were *not* being done since then. Thus > I can get the NN up and running, but with a week old data! > > What is going on here? Why does my NN data *always* wind up causing this > exception over time? Is there some easy way to get notified when the > checkpointing starts to fail? >
--
Robert Dyer [EMAIL PROTECTED]
+
Robert Dyer 2013-02-16, 20:39
|
|
All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by
Sematext