-Re: data loss after cluster wide power loss
Kihwal Lee 2013-07-03, 18:57
For the ext3 bug Colin mentioned, see
https://bugzilla.redhat.com/show_bug.cgi?id=592961. This was fixed in
2.6.32 and backported in RHEL 5.4 (or CENTOS). This has more to do with
file data and affects NN more. Since NN preallocates blocks for edits,
almost all data writes are done without modifying the file length. If
fsync() loses race to pdflush, the edit log content won't be flushed. I am
glad this bug was fixed.
Regardless of this bug, renames will be journaled but not necessarily be
committed before returning. If power fails before the transaction gets
committed, the transaction can be lost. Journaling will make sure the file
system will come back in a consistent state, but not all changes will be
there. To make a rename durable, the directory can be opened and
fsync()'ed. I think both source and destination are recorded in one
transaction, so fsyncing on destination directory should be enough.
When RWR blocks are reported, NN can probably accept and finalize it if
the block size is as expected.
On 7/3/13 10:12 AM, "Colin McCabe" <[EMAIL PROTECTED]> wrote:
>On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <[EMAIL PROTECTED]>
>> Thanks for the detailed email. Sorry I did not read all the details you
>> sent earlier completely (on my phone). As you said, this is not related
>> data loss related to HBase log and hsync. I think you are right; the
>> operation itself might not have hit the disk. I think we should either
>> ensure metadata operation is synced on the datanode or handle it being
>> reported as blockBeingWritten. Let me spend sometime to debug this
>In theory, ext3 is journaled, so all metadata operations should be
>durable in the case of a power outage. It is only data operations
>that should be possible to lose. It is the same for ext4. (Assuming
>you are not using nonstandard mount options.)
>In practice, it is possible that your hard disks didn't actually
>persist the data that they said they did. Rumor has it that some
>drives ignore the SATA FLUSH CACHE command in some cases, since it
>makes them look bad in benchmarks. In that case, there is nothing the
>filesystem or any other software can do.
>There was also a bug in older linux kernels where the kernel would not
>actually send FLUSH CACHE. Since the poster is using ext3 and
>hadoop-1, it's possible he's also using an antique kernel as well. I
>know for sure this affected LVM-- it used to ignore barriers until
>In Ceph, we used to recommend disabling the hard drive write cache if
>your kernel was older than 2.6.33. You can read the recommendation
>for yourself here:
> This will have an impact on performance, however.
>An uninterruptable power supply is not a bad idea.
>I am curious:
>what kernel version you are using?
>are you using LVM?
>> One surprising thing is, all the replicas were reported as
>> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <[EMAIL PROTECTED]> wrote:
>>> (Removing hbase list and adding hdfs-dev list as this is pretty
>>> Reading through the code a bit:
>>> FSDataOutputStream.close calls
>>> DFSOutputStream.close calls
>>> - sets currentPacket.lastPacketInBlock = true
>>> - then calls
>>> - enqueues current packet
>>> - waits for ack
>>> - if (lastPacketInBlock && !receiver.finalized) calls
>>> FSDataset.finalizeBlock calls
>>> FSDataset.finalizeBlockInternal calls
>>> FSVolume.addBlock calls
>>> FSDir.addBlock calls
>>> - renames block from "blocksBeingWritten" tmp dir to "current" dest
>>> This looks to me as I would expect a synchronous chain from a DFS
>>> to moving the file from blocksBeingWritten to the current dir so that