|
|
-
Issue of FSImage, need help
mac fang 2011-06-28, 08:44
Hi, Team,
What we found when we use the Hadoop is, the FSImage often currupts when we do start/stop the Hadoop cluster. The reason we think might be around the write to the outputstream: the NameNode may be killed when it saveNamespace, then the FsImage file doesn't complete writing. Currently i saw a previous.checkpoint folder, the logic of saveNamespace is like:
1. mv the current folder to the previous.checkpoint folder. 2. start to write the FSImage into the current folder.
I think there mightbe a case if the FSImage is currupted, the NameNode can NOT be started, but we can NOT get any EOFException, since we might encounter the OutofMemory exception if we read the wrong numBlocks and instantiate the Blocks [] blocks = new Blocks[numBlocks] (actually, we face this issue).
Any suggestion to it?
thanks macf
-
Re: Issue of FSImage, need help
Denny Ye 2011-06-28, 09:11
*Root cause*: Wrong FSImage format when user killed hdfs process. It may read invalid block number, may be 1 billion or more, OutOfMemoryError happens before EOFException.
How can we provide the validity of FSImage file?
--regards Denny Ye
On Tue, Jun 28, 2011 at 4:44 PM, mac fang <[EMAIL PROTECTED]> wrote:
> Hi, Team, > > What we found when we use the Hadoop is, the FSImage often currupts when we > do start/stop the Hadoop cluster. The reason we think might be around the > write to the outputstream: the NameNode may be killed when it > saveNamespace, > then the FsImage file doesn't complete writing. Currently i saw a > previous.checkpoint folder, the logic of saveNamespace is like: > > 1. mv the current folder to the previous.checkpoint folder. > 2. start to write the FSImage into the current folder. > > I think there mightbe a case if the FSImage is currupted, the NameNode can > NOT be started, but we can NOT get any EOFException, since we might > encounter the OutofMemory exception if we read the wrong numBlocks and > instantiate the Blocks [] blocks = new Blocks[numBlocks] (actually, we face > this issue). > > Any suggestion to it? > > thanks > macf >
-
Re: Issue of FSImage, need help
Todd Lipcon 2011-06-28, 15:03
Hi Denny,
Which version of Hadoop are you using, and when are you killing the NameNode? Are you using a unix signal (eg kill -9) or killing power to the whole machine?
Thanks -Todd
On Tue, Jun 28, 2011 at 2:11 AM, Denny Ye <[EMAIL PROTECTED]> wrote:
> *Root cause*: Wrong FSImage format when user killed hdfs process. It may > read invalid block > number, may be 1 billion or more, OutOfMemoryError happens before > EOFException. > > How can we provide the validity of FSImage file? > > --regards > Denny Ye > > On Tue, Jun 28, 2011 at 4:44 PM, mac fang <[EMAIL PROTECTED]> wrote: > > > Hi, Team, > > > > What we found when we use the Hadoop is, the FSImage often currupts when > we > > do start/stop the Hadoop cluster. The reason we think might be around the > > write to the outputstream: the NameNode may be killed when it > > saveNamespace, > > then the FsImage file doesn't complete writing. Currently i saw a > > previous.checkpoint folder, the logic of saveNamespace is like: > > > > 1. mv the current folder to the previous.checkpoint folder. > > 2. start to write the FSImage into the current folder. > > > > I think there mightbe a case if the FSImage is currupted, the NameNode > can > > NOT be started, but we can NOT get any EOFException, since we might > > encounter the OutofMemory exception if we read the wrong numBlocks and > > instantiate the Blocks [] blocks = new Blocks[numBlocks] (actually, we > face > > this issue). > > > > Any suggestion to it? > > > > thanks > > macf > > >
-- Todd Lipcon Software Engineer, Cloudera
-
Re: Issue of FSImage, need help
mac fang 2011-06-29, 01:11
HI, Todd,
we use the 0.21 version. I think we used the 'kill -9'. The possible timing is when startup or checkpoint.
regards macf
On Tue, Jun 28, 2011 at 11:03 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote:
> Hi Denny, > > Which version of Hadoop are you using, and when are you killing the > NameNode? Are you using a unix signal (eg kill -9) or killing power to the > whole machine? > > Thanks > -Todd > > On Tue, Jun 28, 2011 at 2:11 AM, Denny Ye <[EMAIL PROTECTED]> wrote: > > > *Root cause*: Wrong FSImage format when user killed hdfs process. It may > > read invalid block > > number, may be 1 billion or more, OutOfMemoryError happens before > > EOFException. > > > > How can we provide the validity of FSImage file? > > > > --regards > > Denny Ye > > > > On Tue, Jun 28, 2011 at 4:44 PM, mac fang <[EMAIL PROTECTED]> wrote: > > > > > Hi, Team, > > > > > > What we found when we use the Hadoop is, the FSImage often currupts > when > > we > > > do start/stop the Hadoop cluster. The reason we think might be around > the > > > write to the outputstream: the NameNode may be killed when it > > > saveNamespace, > > > then the FsImage file doesn't complete writing. Currently i saw a > > > previous.checkpoint folder, the logic of saveNamespace is like: > > > > > > 1. mv the current folder to the previous.checkpoint folder. > > > 2. start to write the FSImage into the current folder. > > > > > > I think there mightbe a case if the FSImage is currupted, the NameNode > > can > > > NOT be started, but we can NOT get any EOFException, since we might > > > encounter the OutofMemory exception if we read the wrong numBlocks and > > > instantiate the Blocks [] blocks = new Blocks[numBlocks] (actually, we > > face > > > this issue). > > > > > > Any suggestion to it? > > > > > > thanks > > > macf > > > > > > > > > -- > Todd Lipcon > Software Engineer, Cloudera >
-
Re: Issue of FSImage, need help
mac fang 2011-07-04, 05:07
Guys,
Any clues why the corrupted image could happen.
regards macf
On Wed, Jun 29, 2011 at 9:11 AM, mac fang <[EMAIL PROTECTED]> wrote:
> HI, Todd, > > we use the 0.21 version. I think we used the 'kill -9'. The possible timing > is when startup or checkpoint. > > regards > macf > > > On Tue, Jun 28, 2011 at 11:03 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > >> Hi Denny, >> >> Which version of Hadoop are you using, and when are you killing the >> NameNode? Are you using a unix signal (eg kill -9) or killing power to the >> whole machine? >> >> Thanks >> -Todd >> >> On Tue, Jun 28, 2011 at 2:11 AM, Denny Ye <[EMAIL PROTECTED]> wrote: >> >> > *Root cause*: Wrong FSImage format when user killed hdfs process. It may >> > read invalid block >> > number, may be 1 billion or more, OutOfMemoryError happens before >> > EOFException. >> > >> > How can we provide the validity of FSImage file? >> > >> > --regards >> > Denny Ye >> > >> > On Tue, Jun 28, 2011 at 4:44 PM, mac fang <[EMAIL PROTECTED]> wrote: >> > >> > > Hi, Team, >> > > >> > > What we found when we use the Hadoop is, the FSImage often currupts >> when >> > we >> > > do start/stop the Hadoop cluster. The reason we think might be around >> the >> > > write to the outputstream: the NameNode may be killed when it >> > > saveNamespace, >> > > then the FsImage file doesn't complete writing. Currently i saw a >> > > previous.checkpoint folder, the logic of saveNamespace is like: >> > > >> > > 1. mv the current folder to the previous.checkpoint folder. >> > > 2. start to write the FSImage into the current folder. >> > > >> > > I think there mightbe a case if the FSImage is currupted, the NameNode >> > can >> > > NOT be started, but we can NOT get any EOFException, since we might >> > > encounter the OutofMemory exception if we read the wrong numBlocks and >> > > instantiate the Blocks [] blocks = new Blocks[numBlocks] (actually, we >> > face >> > > this issue). >> > > >> > > Any suggestion to it? >> > > >> > > thanks >> > > macf >> > > >> > >> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> > >
|
|