-Re: data loss after cluster wide power loss
Dave Latham 2013-07-06, 21:26
Thanks for the detailed information Michael, Colin, Suresh, Kiwhal.
It looks like we're on CentOS 5.7 (kernel 2.6.18-274.el5) so if the fix was
included in 5.4 then it sounds like we should have it. I don't believe
we're using LVM.
It sounds like HDFS could improve handling of this scenario by fsyncing
directories after finalizing blocks or perhaps accepting blocksBeingWritten
that are reported later. At least as an option for clients, like HBase,
that need to rely on durability.
On Wed, Jul 3, 2013 at 11:57 AM, Kihwal Lee <[EMAIL PROTECTED]> wrote:
> For the ext3 bug Colin mentioned, see
> https://bugzilla.redhat.com/show_bug.cgi?id=592961. This was fixed in
> 2.6.32 and backported in RHEL 5.4 (or CENTOS). This has more to do with
> file data and affects NN more. Since NN preallocates blocks for edits,
> almost all data writes are done without modifying the file length. If
> fsync() loses race to pdflush, the edit log content won't be flushed. I am
> glad this bug was fixed.
> Regardless of this bug, renames will be journaled but not necessarily be
> committed before returning. If power fails before the transaction gets
> committed, the transaction can be lost. Journaling will make sure the file
> system will come back in a consistent state, but not all changes will be
> there. To make a rename durable, the directory can be opened and
> fsync()'ed. I think both source and destination are recorded in one
> transaction, so fsyncing on destination directory should be enough.
> When RWR blocks are reported, NN can probably accept and finalize it if
> the block size is as expected.
> On 7/3/13 10:12 AM, "Colin McCabe" <[EMAIL PROTECTED]> wrote:
> >On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <[EMAIL PROTECTED]>
> >> Dave,
> >> Thanks for the detailed email. Sorry I did not read all the details you
> >> sent earlier completely (on my phone). As you said, this is not related
> >> data loss related to HBase log and hsync. I think you are right; the
> >> operation itself might not have hit the disk. I think we should either
> >> ensure metadata operation is synced on the datanode or handle it being
> >> reported as blockBeingWritten. Let me spend sometime to debug this
> >In theory, ext3 is journaled, so all metadata operations should be
> >durable in the case of a power outage. It is only data operations
> >that should be possible to lose. It is the same for ext4. (Assuming
> >you are not using nonstandard mount options.)
> >In practice, it is possible that your hard disks didn't actually
> >persist the data that they said they did. Rumor has it that some
> >drives ignore the SATA FLUSH CACHE command in some cases, since it
> >makes them look bad in benchmarks. In that case, there is nothing the
> >filesystem or any other software can do.
> >There was also a bug in older linux kernels where the kernel would not
> >actually send FLUSH CACHE. Since the poster is using ext3 and
> >hadoop-1, it's possible he's also using an antique kernel as well. I
> >know for sure this affected LVM-- it used to ignore barriers until
> >fairly recently.
> >In Ceph, we used to recommend disabling the hard drive write cache if
> >your kernel was older than 2.6.33. You can read the recommendation
> >for yourself here:
> > This will have an impact on performance, however.
> >An uninterruptable power supply is not a bad idea.
> >I am curious:
> >what kernel version you are using?
> >are you using LVM?
> >> One surprising thing is, all the replicas were reported as
> >> blockBeingWritten.
> >> Regards,
> >> Suresh
> >> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <[EMAIL PROTECTED]>
> >>> (Removing hbase list and adding hdfs-dev list as this is pretty
> >>> stuff).