I work for a largish healthcare co. We finally started using (exploiting?)
Hadoop this year.
The day before our big, C-Suite sponsored launch of Hadoop celebration, we
realized that we could no longer ssh to our NN. Fail of all fails! Sorta.
Nothing was really wrong! NN was up, running, and servicing all requests.
Nagios and Ganglia were both satisfied with the replies from NN. And yet,
I still had no access. Boo! This was purely a Linux issue.
Being one to avoid "Game Day" failures, I decided to nuke the badly
behaving NN. PLEASE NOTE! I had the right opportunity. We had few
activities in Prod. I knew what was up with NN and I had a high level of
confidence that things were working NORMALLY on NN regarding HDFS.
So here is the slightly intoxicated order of events (I celebrated after
NN always wrote to 2x Local NN dirs plus 1 NFS
I made sure all DN's were stopped
Our still running NN was Nuked -- AKA Power switch - No Mercy, Pwr Button
Then I cp'd all NFS NN info to my replacement NN (Not 2nd NN or SNN) local
I killed the in_use file as req'd
I started NN Services on the new NN
I restarted all DN Services on all DN's
Somewhere along the line I altered the IP Addr to ensure that my new PNN
was the one everyone was looking for. I specifically did NOT alter Name or
IP within Any HDFD configs.
This all happened in < 30 mins. It was easy. it was testable.
Full credit to the Dev's who built PNN and HDFS... Even at 1.2x levels. ;)
We all want federated NN services, but until then, at least there is
recovery if you have NFS.. and abnormal sanity. ;)