I have a small cluster (3 machines, with 6 data disks/box) running
HBase on top of Hadoop HDFS (1.0.2). Generally this works as expected,
and quite well. But the equipment I have is old and some of the drives
are starting to fail. That would not normally be a problem, except
that whenever a new drive fails I lose data, despite HDFS assurances
When a disk is dying, we add it's machine to the dfs exclude file to
decommission it. This causes the system as a whole to try and copy all
the unreplicated blocks off. There are usually more blocks in this
category than I'd expect, since we're using the default replication
target of 3. After a while, when the decommission process reports that
there are 0 remaining blocks with no live copies, I can remove the bad
path from the list of data directories (in hdfs-site.xml) and add it
back to the cluster.
At this point, I would expect HDFS to report a healthy filesystem. But
when I run hadoop fsck, it reports a corrupt block. When I try and
manually read the file associated with that block (on the disk that
has been removed), I get IO errors. This in and of itself is not
surprising-- obviously HDFS couldn't replicate a block during the
decommission process if it couldn't read it. But how did I even get
into the position of having a block with only a single replica?
How can I setup my cluster to replicate blocks sooner, so that I can
avoid any data loss in the case of bad disks? Are there any settings I
can tweak, or any processes I should be running regularly?
Thanks in advance,