|
Oded Rosen
2011-08-10, 09:22
Allen Wittenauer
2011-08-10, 13:50
Oded Rosen
2011-08-10, 14:25
Evert Lammerts
2011-08-10, 14:56
Scott Carey
2011-08-10, 17:24
Ted Dunning
2011-08-10, 17:40
Scott Carey
2011-08-10, 17:40
Allen Wittenauer
2011-08-10, 19:04
Luke Lu
2011-08-10, 19:19
Brian Bockelman
2011-08-10, 19:31
Ted Dunning
2011-08-10, 19:44
Ted Dunning
2011-08-10, 19:49
Rajiv Chittajallu
2011-08-11, 00:15
Ted Dunning
2011-08-11, 06:13
Steve Loughran
2011-08-13, 19:23
Steve Loughran
2011-08-13, 19:30
|
-
Dedicated disk for operating systemOded Rosen 2011-08-10, 09:22
Hi,
What is the best practice regarding disk allocation on hadoop data nodes? We plan on having multiple storage disks per node, and we want to know if we should save a smaller, separate disk for the os (centos). Is it the suggested configuration, or is it ok to let the OS reside on one of the HDFS storage disks? Thanks, Oded Rosen
-
Re: Dedicated disk for operating systemAllen Wittenauer 2011-08-10, 13:50
On Aug 10, 2011, at 2:22 AM, Oded Rosen wrote: > Hi, > What is the best practice regarding disk allocation on hadoop data nodes? > We plan on having multiple storage disks per node, and we want to know if we should save a smaller, separate disk for the os (centos). > Is it the suggested configuration, or is it ok to let the OS reside on one of the HDFS storage disks? It's a waste to put the OS disk on a separate disk. Every spindle = performance, esp for MR spills. I'm currently configuring: disk 1 - os, swap, app area, MR spill space, HDFS space disk 2 through n - swap, MR spill space, HDFS space The usual reason people say to put the OS on a separate space is to make upgrades easier as you won't have to touch the application. The reality is that you're going to blow away the entire machine during an upgrade anyway. So don't worry about this situation. I know a lot of people combine the MR spill space and HDFS space onto the same partition, but I've found that keeping them separate has two advantages: * No longer have to deal with the stupid math that HDFS uses for reservation--no question as to how much space one actually has * A hard limit on MR space kills badly written jobs before they eat up enough space to nuke HDFS Of course, the big disadvantage is one needs to calculate the correct space needed, and that's a toughie. But if you know your applications then not a problem. Besides, if one gets it wrong, you can always do a rolling re-install to fix it. Also note that in this configuration that one cannot take advantage of the "keep the machine up at all costs" features in newer Hadoop's, which require that root, swap, and the log area be mirrored to be truly effective. I'm not quite convinced that those features are worth it yet for anything smaller than maybe a 12 disk config.
-
RE: Dedicated disk for operating systemOded Rosen 2011-08-10, 14:25
Thanks,
This is helpful -----Original Message----- From: Allen Wittenauer [mailto:[EMAIL PROTECTED]] Sent: Wednesday, August 10, 2011 4:50 PM To: [EMAIL PROTECTED] Subject: Re: Dedicated disk for operating system On Aug 10, 2011, at 2:22 AM, Oded Rosen wrote: > Hi, > What is the best practice regarding disk allocation on hadoop data nodes? > We plan on having multiple storage disks per node, and we want to know if we should save a smaller, separate disk for the os (centos). > Is it the suggested configuration, or is it ok to let the OS reside on one of the HDFS storage disks? It's a waste to put the OS disk on a separate disk. Every spindle = performance, esp for MR spills. I'm currently configuring: disk 1 - os, swap, app area, MR spill space, HDFS space disk 2 through n - swap, MR spill space, HDFS space The usual reason people say to put the OS on a separate space is to make upgrades easier as you won't have to touch the application. The reality is that you're going to blow away the entire machine during an upgrade anyway. So don't worry about this situation. I know a lot of people combine the MR spill space and HDFS space onto the same partition, but I've found that keeping them separate has two advantages: * No longer have to deal with the stupid math that HDFS uses for reservation--no question as to how much space one actually has * A hard limit on MR space kills badly written jobs before they eat up enough space to nuke HDFS Of course, the big disadvantage is one needs to calculate the correct space needed, and that's a toughie. But if you know your applications then not a problem. Besides, if one gets it wrong, you can always do a rolling re-install to fix it. Also note that in this configuration that one cannot take advantage of the "keep the machine up at all costs" features in newer Hadoop's, which require that root, swap, and the log area be mirrored to be truly effective. I'm not quite convinced that those features are worth it yet for anything smaller than maybe a 12 disk config.
-
RE: Dedicated disk for operating systemEvert Lammerts 2011-08-10, 14:56
A short, slightly off-topic question:
> Also note that in this configuration that one cannot take > advantage of the "keep the machine up at all costs" features in newer > Hadoop's, which require that root, swap, and the log area be mirrored > to be truly effective. I'm not quite convinced that those features are > worth it yet for anything smaller than maybe a 12 disk config. Dell and Cloudera promote the C2100. I'd like to see the calculations behind that config. Am I wrong thinking that keeping your cluster up with such dense nodes will only work if you have many (order of magnitude 100+) of them, and interconnected with 10Gb Ethernet? If you don't then recovery times from failing disks / rack switches are going to get crazy, right? If you want to get bang for buck, don't the proportions "disk IO / processing power", "node storage capacity / ethernet speed" and "total amount of nodes / ethernet speed", indicate many small nodes with not too many disks and 1Gb Ethernet? Cheers, Evert
-
Re: Dedicated disk for operating systemScott Carey 2011-08-10, 17:24
On 8/10/11 7:56 AM, "Evert Lammerts" <[EMAIL PROTECTED]> wrote: >A short, slightly off-topic question: > >> Also note that in this configuration that one cannot take >> advantage of the "keep the machine up at all costs" features in newer >> Hadoop's, which require that root, swap, and the log area be mirrored >> to be truly effective. I'm not quite convinced that those features are >> worth it yet for anything smaller than maybe a 12 disk config. > >Dell and Cloudera promote the C2100. I'd like to see the calculations >behind that config. Am I wrong thinking that keeping your cluster up with >such dense nodes will only work if you have many (order of magnitude >100+) of them, and interconnected with 10Gb Ethernet? >If you don't then recovery times from failing disks / rack switches are >going to get crazy, right? If you want to get bang for buck, don't the >proportions "disk IO / processing power", "node storage capacity / >ethernet speed" and "total amount of nodes / ethernet speed", indicate >many small nodes with not too many disks and 1Gb Ethernet? IMO and experience, absolutely. Get to 40 nodes before you even think about going for high density machines with more than 5 drives and the higher end network infrastructure needed. 1GB ethernet (or 2x1GB bonded) with smaller machines (1 quad+ core, 4 drives) and 40 nodes will beat the same total cost cluster with larger (8+ drives, 2x quad core cpu, double the RAM) nodes (~18 or so, with more cores but lower Mhz) every time. And your failure scenario (replicate a 40th of data versus an 18th if a node fails) is better. In short, try to get to 50 to 100 nodes before you look at larger machines. For larger clusters (200 - thousands) the tradeoffs are very different -- expensive network infrastructure is required and the incremental cost of going to 10Gb network isn't as large. The cost sweet spot for a server goes from ~$4k to ~$10k depending on cluster size and whether power or space is a larger cost or limiting factor. FWIW, Dell r310's are decent small nodes (r410's are popular for more CPU heavy workloads, but I'm not a fan due to the larger CPU/disk ratio and much higher CPU$ / (CPU * Mhz) ratio). Next gen single socket, 4 drive servers with SandyBridge Xeon E 1200 series processors will use even less power and have about 50% better CPU performance per core than today's common Nehalem processors used in 2 socket servers due to much higher clock speeds and about 15% better performance at the same clock. The next gen Intel processors after that (Ivy Bridge) promise another big Mhz jump without a power increase in late 2012 / early 2013. At that point, I expect that single socket, 4 or 6 core machines will be optimal for a larger range of use cases than now. Per socket CPU power is increasing at a faster rate than drive and network performance. > >Cheers, >Evert
-
Re: Dedicated disk for operating systemTed Dunning 2011-08-10, 17:40
The reliability question can be answered to first order by computing
replication time for a unit of storage and then computing how often that replication time will contain additional failures sufficient to cause data loss. Such data loss events should be roughly Poisson distributed with rate equal to the rate of the original failures times the probability that any failure actually is a data loss. Second order effects appear when one replication spills into the next increasing the replication period for the second event. It is difficult to impossible to account for all of the second order effects in closed form and I have found it necessary to resort to discrete event simulation to estimate failure mode probabilities in detail. For small numbers of disks per node, one second order effect that becomes important is the node failure rate. Grouping disks into storage groups or failing an entire node when one disk fails are ways that the storage units are larger than individual disks. Use of a volume manager or RAID-0 will increase the storage unit size. These failure modes drive some limitations on cluster size since the absolute rate of storage unit failures increases with cluster size. For a fixed number of drives in each storage unit, the limiting factor is the total number of disk drives, not the number of nodes. For older versions of Hadoop, the storage unit was all drives on the system which is quite dangerous in terms of mean time to data loss. More recently, a fix has been committed to trunk (and I think .204, Todd will correct me if I am wrong) that makes the storage unit equal to a single drive. In the previous situation, it was dangerous to have too many drives on each node in large clusters. With single disk storage units, the number of drives per machine does not matter in this computation. To be specific, taking a 100 node x 10 disk x 2 TB configuration with drive MTBF of 1000 days, we should be seeing drive failures on average once per day. With 1G ethernet and 30MB/s/node dedicated to re-replication, it will just over 10 minutes to restore replication of a single drive and will take just over 100 minutes to restore replication of an entire machine. The probability of 2 disk failures during the 15 minutes after a failure is roughly \lambda^2 e^-\lambda / 2 where \lambda = 15 minutes / 24 hours. This is a small probability so average times between data loss should be relatively long. For the larger storage unit of 10 disks, the probability is not so small and data loss should be expected every few years or so. For a 10,000 node cluster, however, we should expect the average rate of disk failure rate of one failure every 2.5 hours. Here, the number of disks is large enough that the first order computation is much less accurate since the placement of disk blocks across the cluster will often have more non-uniformity due to small counts. This non-uniformity increases the replication recovery time. With the large storage unit model, the probability that three disk failures will stack up becomes unacceptably large. Even with the single disk storage unit, the data loss rate becomes large enough that the cluster cannot be considered archival. The real question about optimal configuration depends on how fast the cluster can move data from disk. If this rate is relatively low compared to the hardware speeds, then supporting full performance from large numbers of drives is very difficult. If you can maintain high transfer rates, however, you can substantially decrease the cost of your cluster by having fewer nodes. On Wed, Aug 10, 2011 at 7:56 AM, Evert Lammerts <[EMAIL PROTECTED]>wrote: > A short, slightly off-topic question: > > > Also note that in this configuration that one cannot take > > advantage of the "keep the machine up at all costs" features in newer > > Hadoop's, which require that root, swap, and the log area be mirrored > > to be truly effective. I'm not quite convinced that those features are
-
Re: Dedicated disk for operating systemScott Carey 2011-08-10, 17:40
On 8/10/11 6:50 AM, "Allen Wittenauer" <[EMAIL PROTECTED]> wrote: > >On Aug 10, 2011, at 2:22 AM, Oded Rosen wrote: > >> Hi, >> What is the best practice regarding disk allocation on hadoop data >>nodes? >> We plan on having multiple storage disks per node, and we want to know >>if we should save a smaller, separate disk for the os (centos). >> Is it the suggested configuration, or is it ok to let the OS reside on >>one of the HDFS storage disks? > > > It's a waste to put the OS disk on a separate disk. Every spindle >performance, esp for MR spills. > > I'm currently configuring: > >disk 1 - os, swap, app area, MR spill space, HDFS space >disk 2 through n - swap, MR spill space, HDFS space We do something similar, except that disk 1 does not have MR spill space. Disk 1 is OS, logs, swap, app area, HDFS. Disk2 is MR spill/temp, HDFS. Also, we put the HDFS partitions in the 'front' of the disk where sequential transfers are faster, and the other stuff at the end. > > The usual reason people say to put the OS on a separate space is to make >upgrades easier as you won't have to touch the application. The reality >is that you're going to blow away the entire machine during an upgrade >anyway. So don't worry about this situation. > > I know a lot of people combine the MR spill space and HDFS space onto >the same partition, but I've found that keeping them separate has two >advantages: > > * No longer have to deal with the stupid math that HDFS uses for >reservation--no question as to how much space one actually has > * A hard limit on MR space kills badly written jobs before they eat up >enough space to nuke HDFS Furthermore, the disk performance is MUCH better if yo split them and optimize the file system and mount parameters for the different workloads. M/R spill in the same place as HDFS was causing a lot of random seeks for us and throttling HDFS performance. * HDFS needs mostly sequential write and read optimized file system and mount parameters (and some random read). It is also not metadata heavy. We found that XFS worked very well for this, and has an online defragmenter we use to keep that partition in good shape. We are not disk I/O bound in HDFS with 4 drives/server this way. Ext4 is an option too, but has no online defragmenter. Ext3 gets really fragmented after a while causing the system to get I/O bound more regularly as the node aged. Ext4 should be much better at avoiding fragmentation than ext3. * M/R spill is metadata intensive, with many small reads and writes in addition to larger writes and reads and files that come and go. We found that using ext4 for this, with optimized mount parameters (rw,noatime,nobarrier,data=writeback,commit=30) tremendously reduced I/O for M/R temp as many files didn't even live 30 seconds to get flushed to disk. These settings are not appropriate for the HDFS partition. XFS is a horrible option for M/R spill and temp -- it performs very poorly with those workloads. > > Of course, the big disadvantage is one needs to calculate the correct >space needed, and that's a toughie. But if you know your applications >then not a problem. Besides, if one gets it wrong, you can always do a >rolling re-install to fix it. > > Also note that in this configuration that one cannot take advantage of >the "keep the machine up at all costs" features in newer Hadoop's, which >require that root, swap, and the log area be mirrored to be truly >effective. I'm not quite convinced that those features are worth it yet >for anything smaller than maybe a 12 disk config.
-
Re: Dedicated disk for operating systemAllen Wittenauer 2011-08-10, 19:04
On Aug 10, 2011, at 7:56 AM, Evert Lammerts wrote: > A short, slightly off-topic question: > >> Also note that in this configuration that one cannot take >> advantage of the "keep the machine up at all costs" features in newer >> Hadoop's, which require that root, swap, and the log area be mirrored >> to be truly effective. I'm not quite convinced that those features are >> worth it yet for anything smaller than maybe a 12 disk config. > > Dell and Cloudera promote the C2100. I'd like to see the calculations behind that config. If Dell is shipping the same box they shipped us to test a few months ago, the performance was pretty horrid vs. almost all their competitors. The main problem was the controller--it was built for RAID, not for JBOD. (... and then there is the OOB support...) > Am I wrong thinking that keeping your cluster up with such dense nodes will only work if you have many (order of magnitude 100+) of them, and interconnected with 10Gb Ethernet? If you don't then recovery times from failing disks / rack switches are going to get crazy, right? If one assumes that a bunch of nodes are failing at once, yes. The irony is that ops teams tend to group repairs, so keeping them up might actually be the wrong thing in relation to actual practice. > If you want to get bang for buck, don't the proportions "disk IO / processing power", "node storage capacity / ethernet speed" and "total amount of nodes / ethernet speed", indicate many small nodes with not too many disks and 1Gb Ethernet? The biggest constraint is almost always RAM, as you can use it to help with the rest.
-
Re: Dedicated disk for operating systemLuke Lu 2011-08-10, 19:19
On Wed, Aug 10, 2011 at 10:40 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> To be specific, taking a 100 node x 10 disk x 2 TB configuration with drive > MTBF of 1000 days, we should be seeing drive failures on average once per > day.... > For a 10,000 node cluster, however, we should expect the average rate of > disk failure rate of one failure every 2.5 hours. Do you have real data to back the analysis? You assume a uniform disk failure distribution, which is absolutely not true. I can only say that our ops data across 40000+ nodes shows that the above analysis is not even close. (This is assuming that the ops know what they are doing though :) __Luke
-
Re: Dedicated disk for operating systemBrian Bockelman 2011-08-10, 19:31
MTTF is a difficult number. Popular papers include: http://db.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html, http://labs.google.com/papers/disk_failures.pdf
Ted is assuming a MTTF of 25kHours; I think that's overly pessimistic, although both papers indicate that MTTF is a crappy way to model disk lifetime. I think a lot has to do with the quality of the batch of hard drives you get and operating conditions. Brain On Aug 10, 2011, at 2:19 PM, Luke Lu wrote: > On Wed, Aug 10, 2011 at 10:40 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: >> To be specific, taking a 100 node x 10 disk x 2 TB configuration with drive >> MTBF of 1000 days, we should be seeing drive failures on average once per >> day.... >> For a 10,000 node cluster, however, we should expect the average rate of >> disk failure rate of one failure every 2.5 hours. > > Do you have real data to back the analysis? You assume a uniform disk > failure distribution, which is absolutely not true. I can only say > that our ops data across 40000+ nodes shows that the above analysis is > not even close. (This is assuming that the ops know what they are > doing though :) > > __Luke
-
Re: Dedicated disk for operating systemTed Dunning 2011-08-10, 19:44
Agreed on both points.
Of course, if you are doing a failure analysis it is something of a professional obligation to be somewhat pessimistic. For one thing, being pessimistic on the factors you know about may compensate to some degree for your inadvertent optimism regarding the factors you don't know about or think are minor. In general, assuming stable distributions with finite variance is a bit dangerous in any case. On the other hand, doing any real modeling with Levy distributions is hard enough that you probably gain less insight for being more correct. It is worth incorporating correlated failures into your model if you can but even that is non-trivial. On Wed, Aug 10, 2011 at 12:31 PM, Brian Bockelman <[EMAIL PROTECTED]>wrote: > Ted is assuming a MTTF of 25kHours; I think that's overly pessimistic, > although both papers indicate that MTTF is a crappy way to model disk > lifetime. >
-
Re: Dedicated disk for operating systemTed Dunning 2011-08-10, 19:49
Luke,
Yes, I do have some data to back this up, but I think I mentioned that this was just the back of an envelope type computation. As such, it necessarily ignores a number of factors. Can you say what specifically it is that you object to? Is the analysis pessimistic or optimistic? Are you seeing lots of correlated failures? I presume that your 40,000+ nodes are not in a single cluster and thus have different failure modes than I was talking about. Perhaps you could say more about your situation. In many installations, duty factor is low enough that average failure rate can be an order of magnitude lower than what I quoted. Even so, I don't feel comfortable using that kind of rate for a computation of this sort. On Wed, Aug 10, 2011 at 12:19 PM, Luke Lu <[EMAIL PROTECTED]> wrote: > On Wed, Aug 10, 2011 at 10:40 AM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > To be specific, taking a 100 node x 10 disk x 2 TB configuration with > drive > > MTBF of 1000 days, we should be seeing drive failures on average once per > > day.... > > For a 10,000 node cluster, however, we should expect the average rate of > > disk failure rate of one failure every 2.5 hours. > > Do you have real data to back the analysis? You assume a uniform disk > failure distribution, which is absolutely not true. I can only say > that our ops data across 40000+ nodes shows that the above analysis is > not even close. (This is assuming that the ops know what they are > doing though :) > > __Luke >
-
Re: Dedicated disk for operating systemRajiv Chittajallu 2011-08-11, 00:15
Ted Dunning wrote on 08/10/11 at 10:40:30 -0700:
>To be specific, taking a 100 node x 10 disk x 2 TB configuration with drive >MTBF of 1000 days, we should be seeing drive failures on average once per >day. With 1G ethernet and 30MB/s/node dedicated to re-replication, it will >just over 10 minutes to restore replication of a single drive and will take >just over 100 minutes to restore replication of an entire machine. You are assuming that only one good node is used to restore replication for all the blocks on the failed drive. Which is very unlikely. With replication factor of 3, you will have at least 2 nodes to choose from in the worst case and much more in a standard cluster. And when you are having more spindles, 6+, one would probably consider using the second GigE port, which is standard on most of the commodity gear out there.
-
Re: Dedicated disk for operating systemTed Dunning 2011-08-11, 06:13
Actually I was assuming that the entire cluster participates in the
rebalancing. Repication is not done disk-wise in hadoop but block-wise. On Wednesday, August 10, 2011, Rajiv Chittajallu <[EMAIL PROTECTED]> wrote: > Ted Dunning wrote on 08/10/11 at 10:40:30 -0700: >>To be specific, taking a 100 node x 10 disk x 2 TB configuration with drive >>MTBF of 1000 days, we should be seeing drive failures on average once per >>day. With 1G ethernet and 30MB/s/node dedicated to re-replication, it will >>just over 10 minutes to restore replication of a single drive and will take >>just over 100 minutes to restore replication of an entire machine. > > You are assuming that only one good node is used to restore replication for > all the blocks on the failed drive. Which is very unlikely. With > replication factor of 3, you will have at least 2 nodes to choose from > in the worst case and much more in a standard cluster. > > And when you are having more spindles, 6+, one would probably consider > using the second GigE port, which is standard on most of the commodity > gear out there. > > >
-
Re: Dedicated disk for operating systemSteve Loughran 2011-08-13, 19:23
On 10/08/2011 20:31, Brian Bockelman wrote:
> MTTF is a difficult number. Popular papers include: http://db.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html, http://labs.google.com/papers/disk_failures.pdf > > Ted is assuming a MTTF of 25kHours; I think that's overly pessimistic, although both papers indicate that MTTF is a crappy way to model disk lifetime. > see also [Gray05] http://research.microsoft.com/apps/pubs/default.aspx?id=64599
-
Re: Dedicated disk for operating systemSteve Loughran 2011-08-13, 19:30
On 11/08/2011 01:15, Rajiv Chittajallu wrote:
> Ted Dunning wrote on 08/10/11 at 10:40:30 -0700: >> To be specific, taking a 100 node x 10 disk x 2 TB configuration with drive >> MTBF of 1000 days, we should be seeing drive failures on average once per >> day. With 1G ethernet and 30MB/s/node dedicated to re-replication, it will >> just over 10 minutes to restore replication of a single drive and will take >> just over 100 minutes to restore replication of an entire machine. > > You are assuming that only one good node is used to restore replication for > all the blocks on the failed drive. Which is very unlikely. With > replication factor of 3, you will have at least 2 nodes to choose from > in the worst case and much more in a standard cluster. > > And when you are having more spindles, 6+, one would probably consider > using the second GigE port, which is standard on most of the commodity > gear out there. > > I'd be willing to collaborate on writing some paper on these issues if someone has data they can share; we can look at the observed/predicted failure rates of 12x2TB HDD systems, and discuss actions (moving blocks to any local space would be better than over LAN, for example), compare 2x1Gbe with 1x10Gbe for recovery (and ignoring switch cost) |