The problem with checkpoint /2nn is that it happily "runs" and has no
outward indication that it is unable to connect.
Because you have a large edits file you startup will complete, however with
that size it could take hours. It logs nothing while this is going on but
as long as the CPU is working that means it is progressing.
We have a nagios check on the size of this directory so if the edit rolling
stops we know about it.
On Saturday, December 17, 2011, Brock Noland <[EMAIL PROTECTED]> wrote:
> Since your using CDH2, I am moving this to CDH-USER. You can subscribe
> BCC'd common-user
> On Sat, Dec 17, 2011 at 2:01 AM, Meng Mao <[EMAIL PROTECTED]> wrote:
>> Maybe this is a bad sign -- the edits.new was created before the master
>> node crashed, and is huge:
>> -bash-3.2$ ls -lh /hadoop/hadoop-metadata/cache/dfs/name/current
>> total 41G
>> -rw-r--r-- 1 hadoop hadoop 3.8K Jan 27 2011 edits
>> -rw-r--r-- 1 hadoop hadoop 39G Dec 17 00:44 edits.new
>> -rw-r--r-- 1 hadoop hadoop 2.5G Jan 27 2011 fsimage
>> -rw-r--r-- 1 hadoop hadoop 8 Jan 27 2011 fstime
>> -rw-r--r-- 1 hadoop hadoop 101 Jan 27 2011 VERSION
>> could this mean something was up with our SecondaryNameNode and rolling
>> edits file?
> Yes it looks like a checkpoint never completed. It's a good idea to
> monitor the mtime on fsimage to ensure it never gets too old.
> Has a checkpoint completed since you restarted?