Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS, mail # user - Re: data loss after cluster wide power loss


Copy link to this message
-
Re: data loss after cluster wide power loss
Azuryy Yu 2013-07-02, 08:45
Hi Uma,

I  think there is minimum performance degration if set
dfs.datanode.synconclose to true.
On Tue, Jul 2, 2013 at 3:31 PM, Uma Maheswara Rao G <[EMAIL PROTECTED]>wrote:

> Hi Dave,
>
>  Looks like your analysis is correct. I have faced similar issue some time
> back.
> See the discussion link:
> http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results
> On sudden restarts, it can lost the OS filesystem edits. Similar thing
> happened in our case, i.e, after restart blocks were moved back to
> BeingWritten directory even though they were finalized.
> After restart they were marked as corrupt. You could set
> dfs.datanode.synconclose to true to avoid this sort of things, but that
> will degrade performance.
>
> Regards,
> Uma
>
> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Dave
> Latham
> Sent: 01 July 2013 16:08
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: Re: data loss after cluster wide power loss
>
> Much appreciated, Suresh.  Let me know if I can provide any more
> information or if you'd like me to open a JIRA.
>
> Dave
>
>
> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <[EMAIL PROTECTED]
> >wrote:
>
> > Dave,
> >
> > Thanks for the detailed email. Sorry I did not read all the details
> > you had sent earlier completely (on my phone). As you said, this is
> > not related to data loss related to HBase log and hsync. I think you
> > are right; the rename operation itself might not have hit the disk. I
> > think we should either ensure metadata operation is synced on the
> > datanode or handle it being reported as blockBeingWritten. Let me
> > spend sometime to debug this issue.
> >
> > One surprising thing is, all the replicas were reported as
> > blockBeingWritten.
> >
> > Regards,
> > Suresh
> >
> >
> > On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <[EMAIL PROTECTED]> wrote:
> >
> >> (Removing hbase list and adding hdfs-dev list as this is pretty
> >> internal stuff).
> >>
> >> Reading through the code a bit:
> >>
> >> FSDataOutputStream.close calls
> >> DFSOutputStream.close calls
> >> DFSOutputStream.closeInternal
> >>  - sets currentPacket.lastPacketInBlock = true
> >>  - then calls
> >> DFSOutputStream.flushInternal
> >>  - enqueues current packet
> >>  - waits for ack
> >>
> >> BlockReceiver.run
> >>  - if (lastPacketInBlock && !receiver.finalized) calls
> >> FSDataset.finalizeBlock calls FSDataset.finalizeBlockInternal calls
> >> FSVolume.addBlock calls FSDir.addBlock calls FSDir.addBlock
> >>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
> >> dir
> >>
> >> This looks to me as I would expect a synchronous chain from a DFS
> >> client to moving the file from blocksBeingWritten to the current dir
> >> so that once the file is closed that it the block files would be in
> >> the proper directory
> >> - even if the contents of the file are still in the OS buffer rather
> >> than synced to disk.  It's only after this moving of blocks that
> >> NameNode.complete file is called.  There are several conditions and
> >> loops in there that I'm not certain this chain is fully reliable in
> >> all cases without a greater understanding of the code.
> >>
> >> Could it be the case that the rename operation itself is not synced
> >> and that ext3 lost the fact that the block files were moved?
> >> Or is there a bug in the close file logic that for some reason the
> >> block files are not always moved into place when a file is closed?
> >>
> >> Thanks for your patience,
> >> Dave
> >>
> >>
> >> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <[EMAIL PROTECTED]>
> wrote:
> >>
> >>> Thanks for the response, Suresh.
> >>>
> >>> I'm not sure that I understand the details properly.  From my
> >>> reading of
> >>> HDFS-744 the hsync API would allow a client to make sure that at any
> >>> point in time it's writes so far hit the disk.  For example, for
> >>> HBase it could apply a fsync after adding some edits to its WAL to