-Re: data loss after cluster wide power loss
Dave Latham 2013-07-02, 01:03
(Removing hbase list and adding hdfs-dev list as this is pretty internal
Reading through the code a bit:
- sets currentPacket.lastPacketInBlock = true
- then calls
- enqueues current packet
- waits for ack
- if (lastPacketInBlock && !receiver.finalized) calls
- renames block from "blocksBeingWritten" tmp dir to "current" dest dir
This looks to me as I would expect a synchronous chain from a DFS client to
moving the file from blocksBeingWritten to the current dir so that once the
file is closed that it the block files would be in the proper directory -
even if the contents of the file are still in the OS buffer rather than
synced to disk. It's only after this moving of blocks that
NameNode.complete file is called. There are several conditions and loops
in there that I'm not certain this chain is fully reliable in all cases
without a greater understanding of the code.
Could it be the case that the rename operation itself is not synced and
that ext3 lost the fact that the block files were moved?
Or is there a bug in the close file logic that for some reason the block
files are not always moved into place when a file is closed?
Thanks for your patience,
On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <[EMAIL PROTECTED]> wrote:
> Thanks for the response, Suresh.
> I'm not sure that I understand the details properly. From my reading of
> HDFS-744 the hsync API would allow a client to make sure that at any point
> in time it's writes so far hit the disk. For example, for HBase it could
> apply a fsync after adding some edits to its WAL to ensure those edits are
> fully durable for a file which is still open.
> However, in this case the dfs file was closed and even renamed. Is it the
> case that even after a dfs file is closed and renamed that the data blocks
> would still not be synced and would still be stored by the datanode in
> "blocksBeingWritten" rather than in "current"? If that is case, would it
> be better for the NameNode not to reject replicas that are in
> blocksBeingWritten, especially if it doesn't have any other replicas
> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <[EMAIL PROTECTED]>wrote:
>> Yes this is a known issue.
>> The HDFS part of this was addressed in
>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is not
>> available in 1.x release. I think HBase does not use this API yet.
>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <[EMAIL PROTECTED]> wrote:
>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes. On Saturday
>> > data center we were in had a total power failure and the cluster went
>> > hard. When we brought it back up, HDFS reported 4 files as CORRUPT. We
>> > recovered the data in question from our secondary datacenter, but I'm
>> > trying to understand what happened and whether this is a bug in HDFS
>> > should be fixed.
>> > From what I can tell the file was created and closed by the dfs client
>> > (hbase). Then HBase renamed it into a new directory and deleted some
>> > files containing the same data. Then the cluster lost power. After the
>> > cluster was restarted, the datanodes reported into the namenode but the
>> > blocks for this file appeared as "blocks being written" - the namenode
>> > rejected them and the datanodes deleted the blocks. At this point there
>> > were no replicas for the blocks and the files were marked CORRUPT. The
>> > underlying file systems are ext3. Some questions that I would love get
>> > answers for if anyone with deeper understanding of HDFS can chime in:
>> > - Is this a known scenario where data loss is expected? (I found
>> > HDFS-1539 but that seems different)