|
|
-
HDFS drive, partition best practice
John Buchanan 2011-02-07, 20:25
Hello,
My company will be building a small but quickly growing Hadoop deployment, and I had a question regarding best practice for configuring the storage for the datanodes. Cloudera has a page where they recommend a JBOD configuration over RAID. My question, though, is whether they are referring to the simplest definition of JBOD, that being literally just a collection of heterogeneous drives, each with its own distinct partition and mount point? Or are they referring to a concatenated span of heterogeneous drives presented to the OS as a single device?
Through some digging I've discovered that data volumes may be specified in a comma-delimited fashion in the hdfs-site.xml file and are then accessed individually, but are of course all available within the pool. To test this I brought a Ubuntu Server 10.04 VM online (on Xen Cloud Platform) with 3 storage volumes. The first is the OS, I created a single partition the second and third, mounting them as /hadoop-datastore/a and /hadoop-datastore/b respectively, specifying them in hdfs-site.xml in comma-delimited fashion. I then continued to construct a single node pseudo-distributed install, executed the bin/start-all.sh script, and all seems just great. The volumes are 5GB each, and HDFS status page shows a configured capacity of 9.84GB, so both are in use, I successfully added a file using bin/hadoop dfs –put.
This lead me to think that perhaps an optimal datanode configuration would be 2 drives in Raid1 for OS, then 2-4 additional drives for data, individually partitioned, mounted, and configured in hdfs-site.xml. Mirrored system drives would make my node more robust but data drives would still be independent. I do realize that HDFS assures data redundancy at a higher level by design, but if the loss of a single drive necessitated rebuilding an entire node, and therefore being down in capacity during that period, just doesn't seem to be the most efficient approach.
Would love to hear what others are doing in this regard, whether anyone is using concatenated disks and whether the loss of a drive requires them to rebuild the entire system.
John Buchanan
-
Re: HDFS drive, partition best practice
Jonathan Disher 2011-02-07, 22:06
I've never seen an implementation of concat volumes that tolerate a disk failure, just like RAID0.
Currently I have a 48 node cluster using Dell R710's with 12 disks - two 250GB SATA drives in RAID1 for OS, and ten 1TB SATA disks as a JBOD (mounted on /data/0 through /data/9) and listed separately in hdfs-site.xml. It works... mostly. The big issues you will encounter is losing a disk - the DataNode process will crash, and if you comment out the affected drive, when you replace it you will have 9 disks full to N% and one empty disk. The DFS balancer cannot fix this - usually when I have data nodes down more than an hour, I format all drives in the box and rebalance.
We are building a new cluster aimed primarily at storage - we will be using SuperMicro 4U machines with 36 2TB SATA disks in three RAID6 volumes (for roughly 20TB usable per volume, 60 total), plus two SSD's for OS. At this scale, JBOD is going to kill you (rebalancing 40-50TB, even when I bond 8 gigE interfaces together, takes too long).
-j
On Feb 7, 2011, at 12:25 PM, John Buchanan wrote:
> Hello, > > My company will be building a small but quickly growing Hadoop deployment, and I had a question regarding best practice for configuring the storage for the datanodes. Cloudera has a page where they recommend a JBOD configuration over RAID. My question, though, is whether they are referring to the simplest definition of JBOD, that being literally just a collection of heterogeneous drives, each with its own distinct partition and mount point? Or are they referring to a concatenated span of heterogeneous drives presented to the OS as a single device? > > Through some digging I've discovered that data volumes may be specified in a comma-delimited fashion in the hdfs-site.xml file and are then accessed individually, but are of course all available within the pool. To test this I brought a Ubuntu Server 10.04 VM online (on Xen Cloud Platform) with 3 storage volumes. The first is the OS, I created a single partition the second and third, mounting them as /hadoop-datastore/a and /hadoop-datastore/b respectively, specifying them in hdfs-site.xml in comma-delimited fashion. I then continued to construct a single node pseudo-distributed install, executed the bin/start-all.sh script, and all seems just great. The volumes are 5GB each, and HDFS status page shows a configured capacity of 9.84GB, so both are in use, I successfully added a file using bin/hadoop dfs –put. > > This lead me to think that perhaps an optimal datanode configuration would be 2 drives in Raid1 for OS, then 2-4 additional drives for data, individually partitioned, mounted, and configured in hdfs-site.xml. Mirrored system drives would make my node more robust but data drives would still be independent. I do realize that HDFS assures data redundancy at a higher level by design, but if the loss of a single drive necessitated rebuilding an entire node, and therefore being down in capacity during that period, just doesn't seem to be the most efficient approach. > > Would love to hear what others are doing in this regard, whether anyone is using concatenated disks and whether the loss of a drive requires them to rebuild the entire system. > > John Buchanan
-
RE: HDFS drive, partition best practice
Scott Golby 2011-02-07, 22:40
> The big issues you will encounter is losing a disk - the DataNode process will crash, and if you comment out the affected drive, > when you replace it you will have 9 disks full to N% and one empty disk. > The DFS balancer cannot fix this - usually when I have data nodes down more than an hour, I format all drives in the box and rebalance.
Yeah this bites us when we add a disk, love getting monitors going off for "disk 90% full" when you've got the new disk at <10%. We've tried a few tricks moving the reserved blocks up to force 'balance' it but it's pretty ineffective by and large. >> but if the loss of a single drive necessitated rebuilding an entire node, and therefore being down in capacity during that period, >> just doesn't seem to be the most efficient approach
This bit about rebuilding the entire node isn't true, that's just Jonathan's choice to wipe the node & an interesting one it is (we might consider that for our small cluster). Lose a disk & you lose just the capacity of that disk from the entire pool of space in the cluster.
1 out of 3 copies of *some* of the HDFS blocks go away, not the entire nodes blocks, usually this wouldn't be very much of a loss (typical 4 disk boxes, x XYZ boxes = quite a few disks). The 1 missing replica will likely be re-copied (I often say re-built, but that's RAID) before you put the new disk in, but say somehow you were 100% full, you'd add the new disk and the blocks which were in a 2 copies/replica state would copy themselves a 3rd time. (the lack of inter-node disk balance is an issue again here) > We are building a new cluster aimed primarily at storage - we will be using SuperMicro 4U machines > with 36 2TB SATA disks in three RAID6 volumes (for roughly 20TB usable per volume, 60 total),
I really like the SuperMicro cases for big disk boxes. What are you using to run the 36 disks all at once ?
Scott Golby
-
Re: HDFS drive, partition best practice
John Buchanan 2011-02-08, 15:20
Thank you Scott and Jonathan, this is exactly the sort of information I've been looking for. Coming from environments where data integrity is not ensured in the same distributed manner as with HDFS, concatenation or striping without parity would make me lose sleep at night.
What we were thinking for our first deployment was 10 HP DL385's each with 8 2TB SATA drives. First pair in Raid1 for the system drive, the remaining each containing a distinct partition and mount point, then specified in hdfs-site.xml in comma-delimited fashion. Seems to make more sense to use Raid at least for the system drives so the loss of 1 drive won't take down the entire node. Granted data integrity wouldn't be affected but how much time do you want to spend rebuilding an entire node due to the loss of one drive. Considered using a smaller pair for the system drives but if they're all the same then we only need to stock one type of spare drive.
Another question I have is whether using 1TB drives would be advisable over 2TB for the purpose of reducing rebuild time. Or perhaps I'm still thinking of this as I would a Raid volume. If we needed to rebalance across the cluster would the time needed be more dependent on the amount of data involved and the connectivity between nodes?
-John
On 2/7/11 4:40 PM, "Scott Golby" <[EMAIL PROTECTED]> wrote:
> >> The big issues you will encounter is losing a disk - the DataNode >>process will crash, and if you comment out the affected drive, >> when you replace it you will have 9 disks full to N% and one empty >>disk. >> The DFS balancer cannot fix this - usually when I have data nodes down >>more than an hour, I format all drives in the box and rebalance. > >Yeah this bites us when we add a disk, love getting monitors going off >for "disk 90% full" when you've got the new disk at <10%. We've tried a >few tricks moving the reserved blocks up to force 'balance' it but it's >pretty ineffective by and large. > > >>> but if the loss of a single drive necessitated rebuilding an entire >>>node, and therefore being down in capacity during that period, >>> just doesn't seem to be the most efficient approach > >This bit about rebuilding the entire node isn't true, that's just >Jonathan's choice to wipe the node & an interesting one it is (we might >consider that for our small cluster). Lose a disk & you lose just the >capacity of that disk from the entire pool of space in the cluster. > >1 out of 3 copies of *some* of the HDFS blocks go away, not the entire >nodes blocks, usually this wouldn't be very much of a loss (typical 4 >disk boxes, x XYZ boxes = quite a few disks). The 1 missing replica will >likely be re-copied (I often say re-built, but that's RAID) before you >put the new disk in, but say somehow you were 100% full, you'd add the >new disk and the blocks which were in a 2 copies/replica state would copy >themselves a 3rd time. (the lack of inter-node disk balance is an issue >again here) > > >> We are building a new cluster aimed primarily at storage - we will be >>using SuperMicro 4U machines >> with 36 2TB SATA disks in three RAID6 volumes (for roughly 20TB usable >>per volume, 60 total), > >I really like the SuperMicro cases for big disk boxes. What are you >using to run the 36 disks all at once ? > >Scott Golby >
-
Re: HDFS drive, partition best practice
Allen Wittenauer 2011-02-08, 17:25
On Feb 8, 2011, at 7:20 AM, John Buchanan wrote: > What we were thinking for our first deployment was 10 HP DL385's each with > 8 2TB SATA drives. First pair in Raid1 for the system drive, the > remaining each containing a distinct partition and mount point, then > specified in hdfs-site.xml in comma-delimited fashion. Seems to make more > sense to use Raid at least for the system drives so the loss of 1 drive > won't take down the entire node. Granted data integrity wouldn't be > affected but how much time do you want to spend rebuilding an entire node > due to the loss of one drive. Considered using a smaller pair for the > system drives but if they're all the same then we only need to stock one > type of spare drive. Don't bother RAID'ing the system drive. Seriously. You're giving up performance for something that rarely happens. If you have decent configuration management, rebuilding a node is not a big deal and doesn't take that long anyway.
Besides, losing one of the JBOD disks will likely bring the node down anyway.
> Another question I have is whether using 1TB drives would be advisable > over 2TB for the purpose of reducing rebuild time.
You're over thinking the rebuild time. Again, configuration management makes this a non-issue. > Or perhaps I'm still > thinking of this as I would a Raid volume. If we needed to rebalance > across the cluster would the time needed be more dependent on the amount > of data involved and the connectivity between nodes?
Yes.
When a node goes down, the data and tasks are automatically moved. So a node can be down for as long as it needs to be down. The grid will still be functional. So don't panic if a compute node goes down. :)
-
Re: HDFS drive, partition best practice
Bharath Mundlapudi 2011-02-08, 19:10
Essentially, you are keeping 2TB (+2TB RAID1) for the first pair. Which files are you planning to put in this pair?
Just curious, how big can userlogs dir can grow in your systems?
-Bharath From: John Buchanan <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Cc: Sent: Tuesday, February 8, 2011 7:20 AM Subject: Re: HDFS drive, partition best practice Thank you Scott and Jonathan, this is exactly the sort of information I've been looking for. Coming from environments where data integrity is not ensured in the same distributed manner as with HDFS, concatenation or striping without parity would make me lose sleep at night.
What we were thinking for our first deployment was 10 HP DL385's each with 8 2TB SATA drives. First pair in Raid1 for the system drive, the remaining each containing a distinct partition and mount point, then specified in hdfs-site.xml in comma-delimited fashion. Seems to make more sense to use Raid at least for the system drives so the loss of 1 drive won't take down the entire node. Granted data integrity wouldn't be affected but how much time do you want to spend rebuilding an entire node due to the loss of one drive. Considered using a smaller pair for the system drives but if they're all the same then we only need to stock one type of spare drive.
Another question I have is whether using 1TB drives would be advisable over 2TB for the purpose of reducing rebuild time. Or perhaps I'm still thinking of this as I would a Raid volume. If we needed to rebalance across the cluster would the time needed be more dependent on the amount of data involved and the connectivity between nodes?
-John
On 2/7/11 4:40 PM, "Scott Golby" <[EMAIL PROTECTED]> wrote:
> >> The big issues you will encounter is losing a disk - the DataNode >>process will crash, and if you comment out the affected drive, >> when you replace it you will have 9 disks full to N% and one empty >>disk. >> The DFS balancer cannot fix this - usually when I have data nodes down >>more than an hour, I format all drives in the box and rebalance. > >Yeah this bites us when we add a disk, love getting monitors going off >for "disk 90% full" when you've got the new disk at <10%. We've tried a >few tricks moving the reserved blocks up to force 'balance' it but it's >pretty ineffective by and large. > > >>> but if the loss of a single drive necessitated rebuilding an entire >>>node, and therefore being down in capacity during that period, >>> just doesn't seem to be the most efficient approach > >This bit about rebuilding the entire node isn't true, that's just >Jonathan's choice to wipe the node & an interesting one it is (we might >consider that for our small cluster). Lose a disk & you lose just the >capacity of that disk from the entire pool of space in the cluster. > >1 out of 3 copies of *some* of the HDFS blocks go away, not the entire >nodes blocks, usually this wouldn't be very much of a loss (typical 4 >disk boxes, x XYZ boxes = quite a few disks). The 1 missing replica will >likely be re-copied (I often say re-built, but that's RAID) before you >put the new disk in, but say somehow you were 100% full, you'd add the >new disk and the blocks which were in a 2 copies/replica state would copy >themselves a 3rd time. (the lack of inter-node disk balance is an issue >again here) > > >> We are building a new cluster aimed primarily at storage - we will be >>using SuperMicro 4U machines >> with 36 2TB SATA disks in three RAID6 volumes (for roughly 20TB usable >>per volume, 60 total), > >I really like the SuperMicro cases for big disk boxes. What are you >using to run the 36 disks all at once ? > >Scott Golby >
-
Re: HDFS drive, partition best practice
Adam Phelps 2011-02-08, 19:33
On 2/7/11 2:06 PM, Jonathan Disher wrote: > Currently I have a 48 node cluster using Dell R710's with 12 disks - two > 250GB SATA drives in RAID1 for OS, and ten 1TB SATA disks as a JBOD > (mounted on /data/0 through /data/9) and listed separately in > hdfs-site.xml. It works... mostly. The big issues you will encounter is > losing a disk - the DataNode process will crash, and if you comment out > the affected drive, when you replace it you will have 9 disks full to N% > and one empty disk.
If DataNode is going down after a single disk failure then you probably haven't set dfs.datanode.failed.volumes.tolerated in hdfs-site.xml. You can up that number to allow DataNode to tolerate dead drives.
- Adam
-
Re: HDFS drive, partition best practice
Patrick Angeles 2011-02-08, 19:53
On Mon, Feb 7, 2011 at 2:06 PM, Jonathan Disher <[EMAIL PROTECTED]> wrote:
> I've never seen an implementation of concat volumes that tolerate a disk > failure, just like RAID0. > > Currently I have a 48 node cluster using Dell R710's with 12 disks - two > 250GB SATA drives in RAID1 for OS, and ten 1TB SATA disks as a JBOD (mounted > on /data/0 through /data/9) and listed separately in hdfs-site.xml. It > works... mostly. The big issues you will encounter is losing a disk - the > DataNode process will crash, and if you comment out the affected drive, when > you replace it you will have 9 disks full to N% and one empty disk. The DFS > balancer cannot fix this - usually when I have data nodes down more than an > hour, I format all drives in the box and rebalance. > You can manually move blocks around between volumes in a single node (while the DN is not running). It would be great someday if the DN managed this automatically. > > We are building a new cluster aimed primarily at storage - we will be using > SuperMicro 4U machines with 36 2TB SATA disks in three RAID6 volumes (for > roughly 20TB usable per volume, 60 total), plus two SSD's for OS. At this > scale, JBOD is going to kill you (rebalancing 40-50TB, even when I bond 8 > gigE interfaces together, takes too long). >
Interesting. It's not JBOD that kills you there, it's the fact that you have 72TB of data on a single box. You mitigate the failure risks by going RAID6, but you're still going to have block movement issues by going with high storage Datanodes if/when they fail for any reason. > > -j > > On Feb 7, 2011, at 12:25 PM, John Buchanan wrote: > > Hello, > > My company will be building a small but quickly growing Hadoop deployment, > and I had a question regarding best practice for configuring the storage for > the datanodes. Cloudera has a page where they recommend a JBOD > configuration over RAID. My question, though, is whether they are referring > to the simplest definition of JBOD, that being literally just a collection > of heterogeneous drives, each with its own distinct partition and mount > point? Or are they referring to a concatenated span of heterogeneous drives > presented to the OS as a single device? > > Through some digging I've discovered that data volumes may be specified in > a comma-delimited fashion in the hdfs-site.xml file and are then accessed > individually, but are of course all available within the pool. To test this > I brought a Ubuntu Server 10.04 VM online (on Xen Cloud Platform) with 3 > storage volumes. The first is the OS, I created a single partition the > second and third, mounting them as /hadoop-datastore/a and > /hadoop-datastore/b respectively, specifying them in hdfs-site.xml in > comma-delimited fashion. I then continued to construct a single node > pseudo-distributed install, executed the bin/start-all.sh script, and all > seems just great. The volumes are 5GB each, and HDFS status page shows a > configured capacity of 9.84GB, so both are in use, I successfully added a > file using bin/hadoop dfs –put. > > This lead me to think that perhaps an optimal datanode configuration would > be 2 drives in Raid1 for OS, then 2-4 additional drives for data, > individually partitioned, mounted, and configured in hdfs-site.xml. > Mirrored system drives would make my node more robust but data drives would > still be independent. I do realize that HDFS assures data redundancy at a > higher level by design, but if the loss of a single drive necessitated > rebuilding an entire node, and therefore being down in capacity during that > period, just doesn't seem to be the most efficient approach. > > Would love to hear what others are doing in this regard, whether anyone is > using concatenated disks and whether the loss of a drive requires them to > rebuild the entire system. > > John Buchanan > > >
-
Re: HDFS drive, partition best practice
Allen Wittenauer 2011-02-08, 20:09
On Feb 8, 2011, at 11:33 AM, Adam Phelps wrote:
> On 2/7/11 2:06 PM, Jonathan Disher wrote: >> Currently I have a 48 node cluster using Dell R710's with 12 disks - two >> 250GB SATA drives in RAID1 for OS, and ten 1TB SATA disks as a JBOD >> (mounted on /data/0 through /data/9) and listed separately in >> hdfs-site.xml. It works... mostly. The big issues you will encounter is >> losing a disk - the DataNode process will crash, and if you comment out >> the affected drive, when you replace it you will have 9 disks full to N% >> and one empty disk. > > If DataNode is going down after a single disk failure then you probably haven't set dfs.datanode.failed.volumes.tolerated in hdfs-site.xml. You can up that number to allow DataNode to tolerate dead drives.
a) only if you have a version that supports it
b) that only protects you on the DN side. The TT is, AFAIK, still susceptible to drive failures.
-
Re: HDFS drive, partition best practice
Patrick Angeles 2011-02-08, 20:17
On Tue, Feb 8, 2011 at 12:09 PM, Allen Wittenauer <[EMAIL PROTECTED]>wrote:
> > On Feb 8, 2011, at 11:33 AM, Adam Phelps wrote: > > > On 2/7/11 2:06 PM, Jonathan Disher wrote: > >> Currently I have a 48 node cluster using Dell R710's with 12 disks - two > >> 250GB SATA drives in RAID1 for OS, and ten 1TB SATA disks as a JBOD > >> (mounted on /data/0 through /data/9) and listed separately in > >> hdfs-site.xml. It works... mostly. The big issues you will encounter is > >> losing a disk - the DataNode process will crash, and if you comment out > >> the affected drive, when you replace it you will have 9 disks full to N% > >> and one empty disk. > > > > If DataNode is going down after a single disk failure then you probably > haven't set dfs.datanode.failed.volumes.tolerated in hdfs-site.xml. You can > up that number to allow DataNode to tolerate dead drives. > > a) only if you have a version that supports it > > b) that only protects you on the DN side. The TT is, AFAIK, still > susceptible to drive failures. c) And it only works when the drive fails on read (HDFS-457), not on write (HDFS-1273).
-
Re: HDFS drive, partition best practice
Patrick Angeles 2011-02-08, 20:22
OT:
Allen, did you turn down a job offer from Google or something? GMail sends everything from you straight to the spam folder.
On Tue, Feb 8, 2011 at 12:17 PM, Patrick Angeles <[EMAIL PROTECTED]>wrote:
> > > On Tue, Feb 8, 2011 at 12:09 PM, Allen Wittenauer < > [EMAIL PROTECTED]> wrote: > >> >> On Feb 8, 2011, at 11:33 AM, Adam Phelps wrote: >> >> > On 2/7/11 2:06 PM, Jonathan Disher wrote: >> >> Currently I have a 48 node cluster using Dell R710's with 12 disks - >> two >> >> 250GB SATA drives in RAID1 for OS, and ten 1TB SATA disks as a JBOD >> >> (mounted on /data/0 through /data/9) and listed separately in >> >> hdfs-site.xml. It works... mostly. The big issues you will encounter is >> >> losing a disk - the DataNode process will crash, and if you comment out >> >> the affected drive, when you replace it you will have 9 disks full to >> N% >> >> and one empty disk. >> > >> > If DataNode is going down after a single disk failure then you probably >> haven't set dfs.datanode.failed.volumes.tolerated in hdfs-site.xml. You can >> up that number to allow DataNode to tolerate dead drives. >> >> a) only if you have a version that supports it >> >> b) that only protects you on the DN side. The TT is, AFAIK, still >> susceptible to drive failures. > > > c) And it only works when the drive fails on read (HDFS-457), not on write > (HDFS-1273). > > >
-
Re: HDFS drive, partition best practice
Allen Wittenauer 2011-02-08, 20:43
On Feb 8, 2011, at 12:22 PM, Patrick Angeles wrote:
> OT: > > Allen, did you turn down a job offer from Google or something? GMail sends > everything from you straight to the spam folder.
I keep ignoring their requests to come interview, so that might have something to do with it. Hmm.
-
Re: HDFS drive, partition best practice
Mag Gam 2011-02-22, 12:34
Interesting conversation. What is your default filesystem? Are you using ext3? On Tue, Feb 8, 2011 at 3:22 PM, Patrick Angeles <[EMAIL PROTECTED]> wrote: > OT: > Allen, did you turn down a job offer from Google or something? GMail sends > everything from you straight to the spam folder. > > On Tue, Feb 8, 2011 at 12:17 PM, Patrick Angeles <[EMAIL PROTECTED]> > wrote: >> >> >> On Tue, Feb 8, 2011 at 12:09 PM, Allen Wittenauer >> <[EMAIL PROTECTED]> wrote: >>> >>> On Feb 8, 2011, at 11:33 AM, Adam Phelps wrote: >>> >>> > On 2/7/11 2:06 PM, Jonathan Disher wrote: >>> >> Currently I have a 48 node cluster using Dell R710's with 12 disks - >>> >> two >>> >> 250GB SATA drives in RAID1 for OS, and ten 1TB SATA disks as a JBOD >>> >> (mounted on /data/0 through /data/9) and listed separately in >>> >> hdfs-site.xml. It works... mostly. The big issues you will encounter >>> >> is >>> >> losing a disk - the DataNode process will crash, and if you comment >>> >> out >>> >> the affected drive, when you replace it you will have 9 disks full to >>> >> N% >>> >> and one empty disk. >>> > >>> > If DataNode is going down after a single disk failure then you probably >>> > haven't set dfs.datanode.failed.volumes.tolerated in hdfs-site.xml. You can >>> > up that number to allow DataNode to tolerate dead drives. >>> >>> a) only if you have a version that supports it >>> >>> b) that only protects you on the DN side. The TT is, AFAIK, still >>> susceptible to drive failures. >> >> c) And it only works when the drive fails on read (HDFS-457), not on write >> (HDFS-1273). >> > >
|
|