Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS, mail # dev - Re: data loss after cluster wide power loss


Copy link to this message
-
Re: data loss after cluster wide power loss
Colin McCabe 2013-07-03, 15:12
On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <[EMAIL PROTECTED]> wrote:
> Dave,
>
> Thanks for the detailed email. Sorry I did not read all the details you had
> sent earlier completely (on my phone). As you said, this is not related to
> data loss related to HBase log and hsync. I think you are right; the rename
> operation itself might not have hit the disk. I think we should either
> ensure metadata operation is synced on the datanode or handle it being
> reported as blockBeingWritten. Let me spend sometime to debug this issue.

In theory, ext3 is journaled, so all metadata operations should be
durable in the case of a power outage.  It is only data operations
that should be possible to lose.  It is the same for ext4.  (Assuming
you are not using nonstandard mount options.)

In practice, it is possible that your hard disks didn't actually
persist the data that they said they did.  Rumor has it that some
drives ignore the SATA FLUSH CACHE command in some cases, since it
makes them look bad in benchmarks.  In that case, there is nothing the
filesystem or any other software can do.

There was also a bug in older linux kernels where the kernel would not
actually send FLUSH CACHE.  Since the poster is using ext3 and
hadoop-1, it's possible he's also using an antique kernel as well.  I
know for sure this affected LVM-- it used to ignore barriers until
fairly recently.

In Ceph, we used to recommend disabling the hard drive write cache if
your kernel was older than 2.6.33.  You can read the recommendation
for yourself here:
http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/
 This will have an impact on performance, however.

An uninterruptable power supply is not a bad idea.

I am curious:
what kernel version you are using?
are you using LVM?

Colin
>
> One surprising thing is, all the replicas were reported as
> blockBeingWritten.
>
> Regards,
> Suresh
>
>
> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <[EMAIL PROTECTED]> wrote:
>>
>> (Removing hbase list and adding hdfs-dev list as this is pretty internal
>> stuff).
>>
>> Reading through the code a bit:
>>
>> FSDataOutputStream.close calls
>> DFSOutputStream.close calls
>> DFSOutputStream.closeInternal
>>  - sets currentPacket.lastPacketInBlock = true
>>  - then calls
>> DFSOutputStream.flushInternal
>>  - enqueues current packet
>>  - waits for ack
>>
>> BlockReceiver.run
>>  - if (lastPacketInBlock && !receiver.finalized) calls
>> FSDataset.finalizeBlock calls
>> FSDataset.finalizeBlockInternal calls
>> FSVolume.addBlock calls
>> FSDir.addBlock calls
>> FSDir.addBlock
>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest dir
>>
>> This looks to me as I would expect a synchronous chain from a DFS client
>> to moving the file from blocksBeingWritten to the current dir so that once
>> the file is closed that it the block files would be in the proper directory
>> - even if the contents of the file are still in the OS buffer rather than
>> synced to disk.  It's only after this moving of blocks that
>> NameNode.complete file is called.  There are several conditions and loops in
>> there that I'm not certain this chain is fully reliable in all cases without
>> a greater understanding of the code.
>>
>> Could it be the case that the rename operation itself is not synced and
>> that ext3 lost the fact that the block files were moved?
>> Or is there a bug in the close file logic that for some reason the block
>> files are not always moved into place when a file is closed?
>>
>> Thanks for your patience,
>> Dave
>>
>>
>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <[EMAIL PROTECTED]> wrote:
>>>
>>> Thanks for the response, Suresh.
>>>
>>> I'm not sure that I understand the details properly.  From my reading of
>>> HDFS-744 the hsync API would allow a client to make sure that at any point
>>> in time it's writes so far hit the disk.  For example, for HBase it could
>>> apply a fsync after adding some edits to its WAL to ensure those edits are