Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Re: Namenode failures


Copy link to this message
-
Re: Namenode failures
It just happened again.  This was after a fresh format of HDFS/HBase and I
am attempting to re-import the (backed up) data.

  http://pastebin.com/3fsWCNQY

So now if I restart the namenode, I will lose data from the past 3 hours.

What is causing this?  How can I avoid it in the future?  Is there an easy
way to monitor (other than a script grep'ing the logs) the checkpoints to
see when this happens?
On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <[EMAIL PROTECTED]> wrote:

> Forgot to mention: Hadoop 1.0.4
>
>
> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <[EMAIL PROTECTED]> wrote:
>
>> I am at a bit of wits end here.  Every single time I restart the
>> namenode, I get this crash:
>>
>> 2013-02-16 14:32:42,616 INFO
>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>> loaded in 0 seconds.
>> 2013-02-16 14:32:42,618 ERROR
>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>> java.lang.NullPointerException
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>
>> I am following best practices here, as far as I know.  I have the
>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>> have the exact same files in them.
>>
>> I also run a secondary checkpoint node.  This one appears to have started
>> failing a week ago.  So checkpoints were *not* being done since then.  Thus
>> I can get the NN up and running, but with a week old data!
>>
>>  What is going on here?  Why does my NN data *always* wind up causing
>> this exception over time?  Is there some easy way to get notified when the
>> checkpointing starts to fail?
>>
>
>
>
> --
>
> Robert Dyer
> [EMAIL PROTECTED]
>

--

Robert Dyer
[EMAIL PROTECTED]
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB