Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Re: Namenode failures


Copy link to this message
-
Re: Namenode failures
It just happened again.  This was after a fresh format of HDFS/HBase and I
am attempting to re-import the (backed up) data.

  http://pastebin.com/3fsWCNQY

So now if I restart the namenode, I will lose data from the past 3 hours.

What is causing this?  How can I avoid it in the future?  Is there an easy
way to monitor (other than a script grep'ing the logs) the checkpoints to
see when this happens?
On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <[EMAIL PROTECTED]> wrote:

> Forgot to mention: Hadoop 1.0.4
>
>
> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <[EMAIL PROTECTED]> wrote:
>
>> I am at a bit of wits end here.  Every single time I restart the
>> namenode, I get this crash:
>>
>> 2013-02-16 14:32:42,616 INFO
>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>> loaded in 0 seconds.
>> 2013-02-16 14:32:42,618 ERROR
>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>> java.lang.NullPointerException
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>
>> I am following best practices here, as far as I know.  I have the
>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>> have the exact same files in them.
>>
>> I also run a secondary checkpoint node.  This one appears to have started
>> failing a week ago.  So checkpoints were *not* being done since then.  Thus
>> I can get the NN up and running, but with a week old data!
>>
>>  What is going on here?  Why does my NN data *always* wind up causing
>> this exception over time?  Is there some easy way to get notified when the
>> checkpointing starts to fail?
>>
>
>
>
> --
>
> Robert Dyer
> [EMAIL PROTECTED]
>

--

Robert Dyer
[EMAIL PROTECTED]