-Re: HDFS drive, partition best practice
Patrick Angeles 2011-02-08, 19:53
On Mon, Feb 7, 2011 at 2:06 PM, Jonathan Disher <[EMAIL PROTECTED]> wrote:
> I've never seen an implementation of concat volumes that tolerate a disk
> failure, just like RAID0.
> Currently I have a 48 node cluster using Dell R710's with 12 disks - two
> 250GB SATA drives in RAID1 for OS, and ten 1TB SATA disks as a JBOD (mounted
> on /data/0 through /data/9) and listed separately in hdfs-site.xml. It
> works... mostly. The big issues you will encounter is losing a disk - the
> DataNode process will crash, and if you comment out the affected drive, when
> you replace it you will have 9 disks full to N% and one empty disk. The DFS
> balancer cannot fix this - usually when I have data nodes down more than an
> hour, I format all drives in the box and rebalance.
You can manually move blocks around between volumes in a single node (while
the DN is not running). It would be great someday if the DN managed this
> We are building a new cluster aimed primarily at storage - we will be using
> SuperMicro 4U machines with 36 2TB SATA disks in three RAID6 volumes (for
> roughly 20TB usable per volume, 60 total), plus two SSD's for OS. At this
> scale, JBOD is going to kill you (rebalancing 40-50TB, even when I bond 8
> gigE interfaces together, takes too long).
Interesting. It's not JBOD that kills you there, it's the fact that you have
72TB of data on a single box. You mitigate the failure risks by going RAID6,
but you're still going to have block movement issues by going with high
storage Datanodes if/when they fail for any reason.
> On Feb 7, 2011, at 12:25 PM, John Buchanan wrote:
> My company will be building a small but quickly growing Hadoop deployment,
> and I had a question regarding best practice for configuring the storage for
> the datanodes. Cloudera has a page where they recommend a JBOD
> configuration over RAID. My question, though, is whether they are referring
> to the simplest definition of JBOD, that being literally just a collection
> of heterogeneous drives, each with its own distinct partition and mount
> point? Or are they referring to a concatenated span of heterogeneous drives
> presented to the OS as a single device?
> Through some digging I've discovered that data volumes may be specified in
> a comma-delimited fashion in the hdfs-site.xml file and are then accessed
> individually, but are of course all available within the pool. To test this
> I brought a Ubuntu Server 10.04 VM online (on Xen Cloud Platform) with 3
> storage volumes. The first is the OS, I created a single partition the
> second and third, mounting them as /hadoop-datastore/a and
> /hadoop-datastore/b respectively, specifying them in hdfs-site.xml in
> comma-delimited fashion. I then continued to construct a single node
> pseudo-distributed install, executed the bin/start-all.sh script, and all
> seems just great. The volumes are 5GB each, and HDFS status page shows a
> configured capacity of 9.84GB, so both are in use, I successfully added a
> file using bin/hadoop dfs –put.
> This lead me to think that perhaps an optimal datanode configuration would
> be 2 drives in Raid1 for OS, then 2-4 additional drives for data,
> individually partitioned, mounted, and configured in hdfs-site.xml.
> Mirrored system drives would make my node more robust but data drives would
> still be independent. I do realize that HDFS assures data redundancy at a
> higher level by design, but if the loss of a single drive necessitated
> rebuilding an entire node, and therefore being down in capacity during that
> period, just doesn't seem to be the most efficient approach.
> Would love to hear what others are doing in this regard, whether anyone is
> using concatenated disks and whether the loss of a drive requires them to
> rebuild the entire system.
> John Buchanan