|
|
-
Name Node Corruption When Shutdown Too Soon
Allen, Jonathan 2010-02-07, 16:45
I've come across a name node bug and just wanted to check if it's a known issue before I formally raise it (I've had a quick look through the database but couldn't see anything obvious).
If the name node is shut down before it has completed reading through the edit log then the edit log gets removed without the image file being updated. This results in name node reverting to its previously saved state (out of sync with the data nodes) and the most recent edits getting lost.
Does anybody recognise this as a known issue or should I raise it?
Thanks, Jonathan Allen UKGP, NS&R, Defence and Security HP Enterprise Services Telephone +44 1682 292101 Email [EMAIL PROTECTED] Street address, Unit 29, Alexandra Way, Ashchurch Business Park, Tewkesbury, Gloucestershire. GL20 8NB
Hewlett-Packard Limited registered Office: Cain Road, Bracknell, Berks RG12 1HN Registered No: 690597 England The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error, you should delete it from your system immediately and advise the sender. To any recipient of this message within HP, unless otherwise stated you should consider this message and attachments as "HP CONFIDENTIAL".
+
Allen, Jonathan 2010-02-07, 16:45
-
Re: Name Node Corruption When Shutdown Too Soon
Konstantin Shvachko 2010-02-08, 21:23
Hi Jonathan,
Thank you for raising the issue. We will need more information about your configuration files.
It sounds like a problem noted by Todd in HDFS-909. If edits directory precedes image in configuration, then edits will be emptied prior to saving the image.
Any way it worth filing a jira on that, and attach logs, config file, whatever you may find helpful for reproducing the problem.
Thanks, --Konstantin Shvachko
On 2/7/2010 8:45 AM, Allen, Jonathan wrote: > I've come across a name node bug and just wanted to check if it's a known issue before I formally raise it (I've had a quick look through the database but couldn't see anything obvious). > > If the name node is shut down before it has completed reading through the edit log then the edit log gets removed without the image file being updated. This results in name node reverting to its previously saved state (out of sync with the data nodes) and the most recent edits getting lost. > > Does anybody recognise this as a known issue or should I raise it? > > Thanks, > Jonathan Allen > UKGP, NS&R, Defence and Security > HP Enterprise Services > Telephone +44 1682 292101 > Email [EMAIL PROTECTED] > Street address, Unit 29, Alexandra Way, Ashchurch Business Park, Tewkesbury, Gloucestershire. GL20 8NB > > Hewlett-Packard Limited registered Office: Cain Road, Bracknell, Berks RG12 1HN > Registered No: 690597 England > The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error, you should delete it from your system immediately and advise the sender. > To any recipient of this message within HP, unless otherwise stated you should consider this message and attachments as "HP CONFIDENTIAL". > > > >
+
Konstantin Shvachko 2010-02-08, 21:23
-
Re: Name Node Corruption When Shutdown Too Soon
Todd Lipcon 2010-02-09, 00:45
Hey Jonathan,
As Konstantin mentioned, I've been looking into a couple issues that could be related. At first glance it doesn't sound like you've run into quite the same thing.
What version did you see this on? The steps to reproduce are something like:
1) Start a NN 2) Perform a bunch of edits so there is a large edit log 3) kill -9 the NN 4) start the NN again 5) while it is in the middle of replaying edits, kill -9 it again 6) start the NN, and lose all the previous edits?
Or did I misunderstand what happened? If that sounds right, I'll give it a go and see if I can reproduce.
Thanks -Todd
On Sun, Feb 7, 2010 at 8:45 AM, Allen, Jonathan <[EMAIL PROTECTED]> wrote: > I've come across a name node bug and just wanted to check if it's a known issue before I formally raise it (I've had a quick look through the database but couldn't see anything obvious). > > If the name node is shut down before it has completed reading through the edit log then the edit log gets removed without the image file being updated. This results in name node reverting to its previously saved state (out of sync with the data nodes) and the most recent edits getting lost. > > Does anybody recognise this as a known issue or should I raise it? > > Thanks, > Jonathan Allen > UKGP, NS&R, Defence and Security > HP Enterprise Services > Telephone +44 1682 292101 > Email [EMAIL PROTECTED] > Street address, Unit 29, Alexandra Way, Ashchurch Business Park, Tewkesbury, Gloucestershire. GL20 8NB > > Hewlett-Packard Limited registered Office: Cain Road, Bracknell, Berks RG12 1HN > Registered No: 690597 England > The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error, you should delete it from your system immediately and advise the sender. > To any recipient of this message within HP, unless otherwise stated you should consider this message and attachments as "HP CONFIDENTIAL". > > > >
+
Todd Lipcon 2010-02-09, 00:45
-
Re: Name Node Corruption When Shutdown Too Soon
Todd Lipcon 2010-02-09, 01:10
Hi Jonathan,
Another question: how have you configured dfs.name.dir? Do you have several directories configured?
Thanks -Todd
On Mon, Feb 8, 2010 at 4:45 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > Hey Jonathan, > > As Konstantin mentioned, I've been looking into a couple issues that > could be related. At first glance it doesn't sound like you've run > into quite the same thing. > > What version did you see this on? The steps to reproduce are something like: > > 1) Start a NN > 2) Perform a bunch of edits so there is a large edit log > 3) kill -9 the NN > 4) start the NN again > 5) while it is in the middle of replaying edits, kill -9 it again > 6) start the NN, and lose all the previous edits? > > Or did I misunderstand what happened? If that sounds right, I'll give > it a go and see if I can reproduce. > > Thanks > -Todd > > On Sun, Feb 7, 2010 at 8:45 AM, Allen, Jonathan <[EMAIL PROTECTED]> wrote: >> I've come across a name node bug and just wanted to check if it's a known issue before I formally raise it (I've had a quick look through the database but couldn't see anything obvious). >> >> If the name node is shut down before it has completed reading through the edit log then the edit log gets removed without the image file being updated. This results in name node reverting to its previously saved state (out of sync with the data nodes) and the most recent edits getting lost. >> >> Does anybody recognise this as a known issue or should I raise it? >> >> Thanks, >> Jonathan Allen >> UKGP, NS&R, Defence and Security >> HP Enterprise Services >> Telephone +44 1682 292101 >> Email [EMAIL PROTECTED] >> Street address, Unit 29, Alexandra Way, Ashchurch Business Park, Tewkesbury, Gloucestershire. GL20 8NB >> >> Hewlett-Packard Limited registered Office: Cain Road, Bracknell, Berks RG12 1HN >> Registered No: 690597 England >> The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error, you should delete it from your system immediately and advise the sender. >> To any recipient of this message within HP, unless otherwise stated you should consider this message and attachments as "HP CONFIDENTIAL". >> >> >> >> >
+
Todd Lipcon 2010-02-09, 01:10
-
RE: Name Node Corruption When Shutdown Too Soon
Allen, Jonathan 2010-02-09, 20:05
Todd,
Unfortunately my test system is air gapped away from the internet so I haven't been able to transfer my test case across yet but the basic steps as are follows:
1) start-dfs (also shutdown the secondary to make sure that it didn't checkpoint away the edit log) 2) create lots of small files so that there is a large edit log (I created about 4,500 files resulting in an edit log of just over 1MB). 3) stop-dfs 4) start-dfs 5) wait for name node to start reading the edit log but not long enough for it to finish reading it (I waited for a couple of seconds). 6) stop-dfs 7) start-dfs 8) listing the hdfs directory now shows it in the same state as at step (1) rather than the correct state as at step (3).
This was running with the Yahoo distro of 0.20.1.
The dfs.name.dir is configured to use directories on 2 local drives and 1 NFS mounted drive.
Thanks, Jonathan
Jonathan Allen UKGP, NS&R, Defence and Security HP Enterprise Services Telephone +44 1682 292101 Email [EMAIL PROTECTED] Street address, Unit 29, Alexandra Way, Ashchurch Business Park, Tewkesbury, Gloucestershire. GL20 8NB
Hewlett-Packard Limited registered Office: Cain Road, Bracknell, Berks RG12 1HN Registered No: 690597 England The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error, you should delete it from your system immediately and advise the sender. To any recipient of this message within HP, unless otherwise stated you should consider this message and attachments as "HP CONFIDENTIAL". -----Original Message----- From: Todd Lipcon [mailto:[EMAIL PROTECTED]] Sent: 09 February 2010 01:11 To: [EMAIL PROTECTED] Subject: Re: Name Node Corruption When Shutdown Too Soon
Hi Jonathan,
Another question: how have you configured dfs.name.dir? Do you have several directories configured?
Thanks -Todd
On Mon, Feb 8, 2010 at 4:45 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > Hey Jonathan, > > As Konstantin mentioned, I've been looking into a couple issues that > could be related. At first glance it doesn't sound like you've run > into quite the same thing. > > What version did you see this on? The steps to reproduce are something like: > > 1) Start a NN > 2) Perform a bunch of edits so there is a large edit log > 3) kill -9 the NN > 4) start the NN again > 5) while it is in the middle of replaying edits, kill -9 it again > 6) start the NN, and lose all the previous edits? > > Or did I misunderstand what happened? If that sounds right, I'll give > it a go and see if I can reproduce. > > Thanks > -Todd > > On Sun, Feb 7, 2010 at 8:45 AM, Allen, Jonathan <[EMAIL PROTECTED]> wrote: >> I've come across a name node bug and just wanted to check if it's a known issue before I formally raise it (I've had a quick look through the database but couldn't see anything obvious). >> >> If the name node is shut down before it has completed reading through the edit log then the edit log gets removed without the image file being updated. This results in name node reverting to its previously saved state (out of sync with the data nodes) and the most recent edits getting lost. >> >> Does anybody recognise this as a known issue or should I raise it? >> >> Thanks, >> Jonathan Allen >> UKGP, NS&R, Defence and Security >> HP Enterprise Services >> Telephone +44 1682 292101 >> Email [EMAIL PROTECTED] >> Street address, Unit 29, Alexandra Way, Ashchurch Business Park, Tewkesbury, Gloucestershire. GL20 8NB >> >> Hewlett-Packard Limited registered Office: Cain Road, Bracknell, Berks RG12 1HN >> Registered No: 690597 England >> The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error, you should delete it from your system immediately and advise the sender. >> To any recipient of this message within HP, unless otherwise stated you should consider this message and attachments as "HP CONFIDENTIAL".
+
Allen, Jonathan 2010-02-09, 20:05
-
Re: Name Node Corruption When Shutdown Too Soon
Todd Lipcon 2010-02-09, 20:11
Thanks Jonathan,
I tried to reproduce this yesterday using a single dfs.name.dir, but I'll give it a go again with multiple.
Will let you know what I turn up.
-Todd
On Tue, Feb 9, 2010 at 12:05 PM, Allen, Jonathan <[EMAIL PROTECTED]> wrote: > Todd, > > Unfortunately my test system is air gapped away from the internet so I haven't been able to transfer my test case across yet but the basic steps as are follows: > > 1) start-dfs (also shutdown the secondary to make sure that it didn't checkpoint away the edit log) > 2) create lots of small files so that there is a large edit log (I created about 4,500 files resulting in an edit log of just over 1MB). > 3) stop-dfs > 4) start-dfs > 5) wait for name node to start reading the edit log but not long enough for it to finish reading it (I waited for a couple of seconds). > 6) stop-dfs > 7) start-dfs > 8) listing the hdfs directory now shows it in the same state as at step (1) rather than the correct state as at step (3). > > This was running with the Yahoo distro of 0.20.1. > > The dfs.name.dir is configured to use directories on 2 local drives and 1 NFS mounted drive. > > Thanks, > Jonathan > > Jonathan Allen > UKGP, NS&R, Defence and Security > HP Enterprise Services > Telephone +44 1682 292101 > Email [EMAIL PROTECTED] > Street address, Unit 29, Alexandra Way, Ashchurch Business Park, Tewkesbury, Gloucestershire. GL20 8NB > > Hewlett-Packard Limited registered Office: Cain Road, Bracknell, Berks RG12 1HN > Registered No: 690597 England > The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error, you should delete it from your system immediately and advise the sender. > To any recipient of this message within HP, unless otherwise stated you should consider this message and attachments as "HP CONFIDENTIAL". > > > -----Original Message----- > From: Todd Lipcon [mailto:[EMAIL PROTECTED]] > Sent: 09 February 2010 01:11 > To: [EMAIL PROTECTED] > Subject: Re: Name Node Corruption When Shutdown Too Soon > > Hi Jonathan, > > Another question: how have you configured dfs.name.dir? Do you have > several directories configured? > > Thanks > -Todd > > On Mon, Feb 8, 2010 at 4:45 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: >> Hey Jonathan, >> >> As Konstantin mentioned, I've been looking into a couple issues that >> could be related. At first glance it doesn't sound like you've run >> into quite the same thing. >> >> What version did you see this on? The steps to reproduce are something like: >> >> 1) Start a NN >> 2) Perform a bunch of edits so there is a large edit log >> 3) kill -9 the NN >> 4) start the NN again >> 5) while it is in the middle of replaying edits, kill -9 it again >> 6) start the NN, and lose all the previous edits? >> >> Or did I misunderstand what happened? If that sounds right, I'll give >> it a go and see if I can reproduce. >> >> Thanks >> -Todd >> >> On Sun, Feb 7, 2010 at 8:45 AM, Allen, Jonathan <[EMAIL PROTECTED]> wrote: >>> I've come across a name node bug and just wanted to check if it's a known issue before I formally raise it (I've had a quick look through the database but couldn't see anything obvious). >>> >>> If the name node is shut down before it has completed reading through the edit log then the edit log gets removed without the image file being updated. This results in name node reverting to its previously saved state (out of sync with the data nodes) and the most recent edits getting lost. >>> >>> Does anybody recognise this as a known issue or should I raise it? >>> >>> Thanks, >>> Jonathan Allen >>> UKGP, NS&R, Defence and Security >>> HP Enterprise Services >>> Telephone +44 1682 292101 >>> Email [EMAIL PROTECTED] >>> Street address, Unit 29, Alexandra Way, Ashchurch Business Park, Tewkesbury, Gloucestershire. GL20 8NB >>> >>> Hewlett-Packard Limited registered Office: Cain Road, Bracknell, Berks RG12 1HN >>> Registered No: 690597 England
+
Todd Lipcon 2010-02-09, 20:11
-
Re: Name Node Corruption When Shutdown Too Soon
Todd Lipcon 2010-02-10, 01:27
Hi Jonathan,
I've reproduced your issue.
I'll comment on HDFS-955 as I believe this is another manifestation of the same issue.
-Todd
On Tue, Feb 9, 2010 at 12:11 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > Thanks Jonathan, > > I tried to reproduce this yesterday using a single dfs.name.dir, but > I'll give it a go again with multiple. > > Will let you know what I turn up. > > -Todd > > On Tue, Feb 9, 2010 at 12:05 PM, Allen, Jonathan <[EMAIL PROTECTED]> wrote: >> Todd, >> >> Unfortunately my test system is air gapped away from the internet so I haven't been able to transfer my test case across yet but the basic steps as are follows: >> >> 1) start-dfs (also shutdown the secondary to make sure that it didn't checkpoint away the edit log) >> 2) create lots of small files so that there is a large edit log (I created about 4,500 files resulting in an edit log of just over 1MB). >> 3) stop-dfs >> 4) start-dfs >> 5) wait for name node to start reading the edit log but not long enough for it to finish reading it (I waited for a couple of seconds). >> 6) stop-dfs >> 7) start-dfs >> 8) listing the hdfs directory now shows it in the same state as at step (1) rather than the correct state as at step (3). >> >> This was running with the Yahoo distro of 0.20.1. >> >> The dfs.name.dir is configured to use directories on 2 local drives and 1 NFS mounted drive. >> >> Thanks, >> Jonathan >> >> Jonathan Allen >> UKGP, NS&R, Defence and Security >> HP Enterprise Services >> Telephone +44 1682 292101 >> Email [EMAIL PROTECTED] >> Street address, Unit 29, Alexandra Way, Ashchurch Business Park, Tewkesbury, Gloucestershire. GL20 8NB >> >> Hewlett-Packard Limited registered Office: Cain Road, Bracknell, Berks RG12 1HN >> Registered No: 690597 England >> The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error, you should delete it from your system immediately and advise the sender. >> To any recipient of this message within HP, unless otherwise stated you should consider this message and attachments as "HP CONFIDENTIAL". >> >> >> -----Original Message----- >> From: Todd Lipcon [mailto:[EMAIL PROTECTED]] >> Sent: 09 February 2010 01:11 >> To: [EMAIL PROTECTED] >> Subject: Re: Name Node Corruption When Shutdown Too Soon >> >> Hi Jonathan, >> >> Another question: how have you configured dfs.name.dir? Do you have >> several directories configured? >> >> Thanks >> -Todd >> >> On Mon, Feb 8, 2010 at 4:45 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: >>> Hey Jonathan, >>> >>> As Konstantin mentioned, I've been looking into a couple issues that >>> could be related. At first glance it doesn't sound like you've run >>> into quite the same thing. >>> >>> What version did you see this on? The steps to reproduce are something like: >>> >>> 1) Start a NN >>> 2) Perform a bunch of edits so there is a large edit log >>> 3) kill -9 the NN >>> 4) start the NN again >>> 5) while it is in the middle of replaying edits, kill -9 it again >>> 6) start the NN, and lose all the previous edits? >>> >>> Or did I misunderstand what happened? If that sounds right, I'll give >>> it a go and see if I can reproduce. >>> >>> Thanks >>> -Todd >>> >>> On Sun, Feb 7, 2010 at 8:45 AM, Allen, Jonathan <[EMAIL PROTECTED]> wrote: >>>> I've come across a name node bug and just wanted to check if it's a known issue before I formally raise it (I've had a quick look through the database but couldn't see anything obvious). >>>> >>>> If the name node is shut down before it has completed reading through the edit log then the edit log gets removed without the image file being updated. This results in name node reverting to its previously saved state (out of sync with the data nodes) and the most recent edits getting lost. >>>> >>>> Does anybody recognise this as a known issue or should I raise it? >>>> >>>> Thanks, >>>> Jonathan Allen >>>> UKGP, NS&R, Defence and Security >>>> HP Enterprise Services
+
Todd Lipcon 2010-02-10, 01:27
-
Re: Name Node Corruption When Shutdown Too Soon
Konstantin Shvachko 2010-02-10, 00:41
What is the value of dfs.name.edits.dir? Is it the default, which would be the same as dfs.name.dir, or is it different?
Thanks, --Konstantin
On 2/9/2010 12:05 PM, Allen, Jonathan wrote: > Todd, > > Unfortunately my test system is air gapped away from the internet so I haven't been able to transfer my test case across yet but the basic steps as are follows: > > 1) start-dfs (also shutdown the secondary to make sure that it didn't checkpoint away the edit log) > 2) create lots of small files so that there is a large edit log (I created about 4,500 files resulting in an edit log of just over 1MB). > 3) stop-dfs > 4) start-dfs > 5) wait for name node to start reading the edit log but not long enough for it to finish reading it (I waited for a couple of seconds). > 6) stop-dfs > 7) start-dfs > 8) listing the hdfs directory now shows it in the same state as at step (1) rather than the correct state as at step (3). > > This was running with the Yahoo distro of 0.20.1. > > The dfs.name.dir is configured to use directories on 2 local drives and 1 NFS mounted drive. > > Thanks, > Jonathan > > Jonathan Allen > UKGP, NS&R, Defence and Security > HP Enterprise Services > Telephone +44 1682 292101 > Email [EMAIL PROTECTED] > Street address, Unit 29, Alexandra Way, Ashchurch Business Park, Tewkesbury, Gloucestershire. GL20 8NB > > Hewlett-Packard Limited registered Office: Cain Road, Bracknell, Berks RG12 1HN > Registered No: 690597 England > The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error, you should delete it from your system immediately and advise the sender. > To any recipient of this message within HP, unless otherwise stated you should consider this message and attachments as "HP CONFIDENTIAL". > > > -----Original Message----- > From: Todd Lipcon [mailto:[EMAIL PROTECTED]] > Sent: 09 February 2010 01:11 > To: [EMAIL PROTECTED] > Subject: Re: Name Node Corruption When Shutdown Too Soon > > Hi Jonathan, > > Another question: how have you configured dfs.name.dir? Do you have > several directories configured? > > Thanks > -Todd > > On Mon, Feb 8, 2010 at 4:45 PM, Todd Lipcon<[EMAIL PROTECTED]> wrote: >> Hey Jonathan, >> >> As Konstantin mentioned, I've been looking into a couple issues that >> could be related. At first glance it doesn't sound like you've run >> into quite the same thing. >> >> What version did you see this on? The steps to reproduce are something like: >> >> 1) Start a NN >> 2) Perform a bunch of edits so there is a large edit log >> 3) kill -9 the NN >> 4) start the NN again >> 5) while it is in the middle of replaying edits, kill -9 it again >> 6) start the NN, and lose all the previous edits? >> >> Or did I misunderstand what happened? If that sounds right, I'll give >> it a go and see if I can reproduce. >> >> Thanks >> -Todd >> >> On Sun, Feb 7, 2010 at 8:45 AM, Allen, Jonathan<[EMAIL PROTECTED]> wrote: >>> I've come across a name node bug and just wanted to check if it's a known issue before I formally raise it (I've had a quick look through the database but couldn't see anything obvious). >>> >>> If the name node is shut down before it has completed reading through the edit log then the edit log gets removed without the image file being updated. This results in name node reverting to its previously saved state (out of sync with the data nodes) and the most recent edits getting lost. >>> >>> Does anybody recognise this as a known issue or should I raise it? >>> >>> Thanks, >>> Jonathan Allen >>> UKGP, NS&R, Defence and Security >>> HP Enterprise Services >>> Telephone +44 1682 292101 >>> Email [EMAIL PROTECTED] >>> Street address, Unit 29, Alexandra Way, Ashchurch Business Park, Tewkesbury, Gloucestershire. GL20 8NB >>> >>> Hewlett-Packard Limited registered Office: Cain Road, Bracknell, Berks RG12 1HN >>> Registered No: 690597 England >>> The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error, you should delete it from your system immediately and advise the sender.
+
Konstantin Shvachko 2010-02-10, 00:41
|
|