Azuryy Yu 2013-07-02, 08:45
How did you lose power to the entire cluster?
I realize that this question goes beyond HBase, but is an Ops question. Do you have redundant power sources and redundant power supplies to the racks and machines in the cluster?
On Jul 2, 2013, at 7:42 AM, Dave Latham <[EMAIL PROTECTED]> wrote:
> Hi Uma,
> Thanks for the pointer. Your case sounds very similar. The main differences that I see are that in my case it happened on all 3 replicas and the power failure occurred merely seconds after the blocks were finalized. So I guess the question is whether HDFS can do anything to better recover from such situations. I'm also curious whether ext4 would be less susceptible than ext3.
> I will definitely look at enabling dfs.datanode.synconclose once we upgrade to a version of hdfs that has it. I would love to see some performance numbers if anyone has run them. Also appears that HBase is considering enabling it by default (cf. comments on HBase-5954).
> On Tue, Jul 2, 2013 at 12:31 AM, Uma Maheswara Rao G <[EMAIL PROTECTED]> wrote:
> Hi Dave,
> Looks like your analysis is correct. I have faced similar issue some time back.
> See the discussion link: http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results
> On sudden restarts, it can lost the OS filesystem edits. Similar thing happened in our case, i.e, after restart blocks were moved back to BeingWritten directory even though they were finalized.
> After restart they were marked as corrupt. You could set dfs.datanode.synconclose to true to avoid this sort of things, but that will degrade performance.
> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Dave Latham
> Sent: 01 July 2013 16:08
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: Re: data loss after cluster wide power loss
> Much appreciated, Suresh. Let me know if I can provide any more information or if you'd like me to open a JIRA.
> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <[EMAIL PROTECTED]>wrote:
> > Dave,
> > Thanks for the detailed email. Sorry I did not read all the details
> > you had sent earlier completely (on my phone). As you said, this is
> > not related to data loss related to HBase log and hsync. I think you
> > are right; the rename operation itself might not have hit the disk. I
> > think we should either ensure metadata operation is synced on the
> > datanode or handle it being reported as blockBeingWritten. Let me
> > spend sometime to debug this issue.
> > One surprising thing is, all the replicas were reported as
> > blockBeingWritten.
> > Regards,
> > Suresh
> > On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <[EMAIL PROTECTED]> wrote:
> >> (Removing hbase list and adding hdfs-dev list as this is pretty
> >> internal stuff).
> >> Reading through the code a bit:
> >> FSDataOutputStream.close calls
> >> DFSOutputStream.close calls
> >> DFSOutputStream.closeInternal
> >> - sets currentPacket.lastPacketInBlock = true
> >> - then calls
> >> DFSOutputStream.flushInternal
> >> - enqueues current packet
> >> - waits for ack
> >> BlockReceiver.run
> >> - if (lastPacketInBlock && !receiver.finalized) calls
> >> FSDataset.finalizeBlock calls FSDataset.finalizeBlockInternal calls
> >> FSVolume.addBlock calls FSDir.addBlock calls FSDir.addBlock
> >> - renames block from "blocksBeingWritten" tmp dir to "current" dest
> >> dir
> >> This looks to me as I would expect a synchronous chain from a DFS
> >> client to moving the file from blocksBeingWritten to the current dir
> >> so that once the file is closed that it the block files would be in
> >> the proper directory
> >> - even if the contents of the file are still in the OS buffer rather
> >> than synced to disk. It's only after this moving of blocks that
> >> NameNode.complete file is called. There are several conditions and