|
Steve Ed
2011-10-21, 21:40
Rita
2011-11-07, 10:58
Ted Dunning
2011-11-07, 22:06
Rita
2011-11-08, 00:34
Ted Dunning
2011-11-08, 00:53
Rita
2011-11-08, 12:32
Ted Dunning
2011-11-08, 12:38
Matt Foley
2011-11-11, 09:57
Koji Noguchi
2011-11-11, 17:26
Ted Dunning
2011-11-11, 17:49
Steve Ed
2011-11-11, 17:59
Matt Foley
2011-11-11, 18:15
Todd Lipcon
2011-11-11, 18:37
Matt Foley
2011-11-15, 22:29
|
-
Sizing helpSteve Ed 2011-10-21, 21:40
I am a newbie to Hadoop and trying to understand how to Size a Hadoop
cluster. What are factors I should consider deciding the number of datanodes ? Datanode configuration ? CPU, Memory Amount of memory required for namenode ? My client is looking at 1 PB of usable data and will be running analytics on TB size files using mapreduce. Thanks ... Steve
-
Re: Sizing helpRita 2011-11-07, 10:58
For a 1PB installation you would need close to 170 servers with 12 TB disk
pack installed on them (with replication factor of 2). Thats a conservative estimate CPUs: 4 cores with 16gb of memory Namenode: 4 core with 32gb of memory should be ok. On Fri, Oct 21, 2011 at 5:40 PM, Steve Ed <[EMAIL PROTECTED]> wrote: > I am a newbie to Hadoop and trying to understand how to Size a Hadoop > cluster. **** > > ** ** > > What are factors I should consider deciding the number of datanodes ?**** > > Datanode configuration ? CPU, Memory**** > > Amount of memory required for namenode ? **** > > ** ** > > My client is looking at 1 PB of usable data and will be running analytics > on TB size files using mapreduce.**** > > ** ** > > ** ** > > Thanks**** > > ….. Steve**** > > ** ** > -- --- Get your facts first, then you can distort them as you please.--
-
Re: Sizing helpTed Dunning 2011-11-07, 22:06
Depending on which distribution and what your data center power limits are
you may save a lot of money by going with machines that have 12 x 2 or 3 tb drives. With suitable engineering margins and 3 x replication you can have 5 tb net data per node and 20 nodes per rack. If you want to go all cowboy with 2x replication and little space to spare then you can double that density. On Monday, November 7, 2011, Rita <[EMAIL PROTECTED]> wrote: > For a 1PB installation you would need close to 170 servers with 12 TB disk pack installed on them (with replication factor of 2). Thats a conservative estimate > CPUs: 4 cores with 16gb of memory > > Namenode: 4 core with 32gb of memory should be ok. > > > On Fri, Oct 21, 2011 at 5:40 PM, Steve Ed <[EMAIL PROTECTED]> wrote: >> >> I am a newbie to Hadoop and trying to understand how to Size a Hadoop cluster. >> >> >> >> What are factors I should consider deciding the number of datanodes ? >> >> Datanode configuration ? CPU, Memory >> >> Amount of memory required for namenode ? >> >> >> >> My client is looking at 1 PB of usable data and will be running analytics on TB size files using mapreduce. >> >> >> >> >> >> Thanks >> >> ….. Steve >> >> > > > -- > --- Get your facts first, then you can distort them as you please.-- >
-
Re: Sizing helpRita 2011-11-08, 00:34
I have been running with 2x replication on a 500tb cluster. No issues
whatsoever. 3x is for super paranoid. On Mon, Nov 7, 2011 at 5:06 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Depending on which distribution and what your data center power limits are > you may save a lot of money by going with machines that have 12 x 2 or 3 tb > drives. With suitable engineering margins and 3 x replication you can have > 5 tb net data per node and 20 nodes per rack. If you want to go all cowboy > with 2x replication and little space to spare then you can double that > density. > > On Monday, November 7, 2011, Rita <[EMAIL PROTECTED]> wrote: > > For a 1PB installation you would need close to 170 servers with 12 TB > disk pack installed on them (with replication factor of 2). Thats a > conservative estimate > > CPUs: 4 cores with 16gb of memory > > > > Namenode: 4 core with 32gb of memory should be ok. > > > > > > On Fri, Oct 21, 2011 at 5:40 PM, Steve Ed <[EMAIL PROTECTED]> wrote: > >> > >> I am a newbie to Hadoop and trying to understand how to Size a Hadoop > cluster. > >> > >> > >> > >> What are factors I should consider deciding the number of datanodes ? > >> > >> Datanode configuration ? CPU, Memory > >> > >> Amount of memory required for namenode ? > >> > >> > >> > >> My client is looking at 1 PB of usable data and will be running > analytics on TB size files using mapreduce. > >> > >> > >> > >> > >> > >> Thanks > >> > >> ….. Steve > >> > >> > > > > > > -- > > --- Get your facts first, then you can distort them as you please.-- > > > -- --- Get your facts first, then you can distort them as you please.--
-
Re: Sizing helpTed Dunning 2011-11-08, 00:53
3x replication has two effects. One is reliability. This is probably more
important in large clusters than small. Another important effect is data locality during map-reduce. Having 3x replication allows mappers to have almost all invocations read from local disk. 2x replication compromises this. Even where you don't have local data, the bandwidth available to read from 3x replicated data is 1.5x the bandwidth available for 2x replication. To get a rough feel for how reliable you should consider a cluster, you can do a pretty simple computation. If you have 12 x 2T on a single machine and you lose that machine, the remaining copies of that data must be replicated before another disk fails. With HDFS and block-level replication, the remaining copies will be spread across the entire cluster to any disk failure is reasonably like to cause data loss. For a 1000 node cluster with 12000 disks, it is conservative to estimate a disk failure on average every 2 hours. Each node will have replicate about 12GB of data which will take about 500 seconds or about 9 or 10 minutes if you only use 25% of your network for re-replication. The probability of a disk failure during a 10 minute period is 1-exp(-10/120) = 8%. This means that roughly 1 in 12 full machine failures might cause data loss. You can pick whatever you like for the rate at which nodes die, but I don't think that this is acceptable. My numbers for disk failures are purposely somewhat pessimistic. If you change the MTBF for disks to 10 years instead of 3 years, then the probability of data loss after a machine failure drops, but only to about 2.5%. Now, I would be the first to say that these numbers feel too high, but I also would rather not experience enough data loss events to have a reliable gut feel for how often they should occur. My feeling is that 2x is fine for data you can reconstruct and which you don't need to read really fast, but not good enough for data whose loss will get you fired. On Mon, Nov 7, 2011 at 7:34 PM, Rita <[EMAIL PROTECTED]> wrote: > I have been running with 2x replication on a 500tb cluster. No issues > whatsoever. 3x is for super paranoid. > > > On Mon, Nov 7, 2011 at 5:06 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> Depending on which distribution and what your data center power limits >> are you may save a lot of money by going with machines that have 12 x 2 or >> 3 tb drives. With suitable engineering margins and 3 x replication you can >> have 5 tb net data per node and 20 nodes per rack. If you want to go all >> cowboy with 2x replication and little space to spare then you can double >> that density. >> >> On Monday, November 7, 2011, Rita <[EMAIL PROTECTED]> wrote: >> > For a 1PB installation you would need close to 170 servers with 12 TB >> disk pack installed on them (with replication factor of 2). Thats a >> conservative estimate >> > CPUs: 4 cores with 16gb of memory >> > >> > Namenode: 4 core with 32gb of memory should be ok. >> > >> > >> > On Fri, Oct 21, 2011 at 5:40 PM, Steve Ed <[EMAIL PROTECTED]> wrote: >> >> >> >> I am a newbie to Hadoop and trying to understand how to Size a Hadoop >> cluster. >> >> >> >> >> >> >> >> What are factors I should consider deciding the number of datanodes ? >> >> >> >> Datanode configuration ? CPU, Memory >> >> >> >> Amount of memory required for namenode ? >> >> >> >> >> >> >> >> My client is looking at 1 PB of usable data and will be running >> analytics on TB size files using mapreduce. >> >> >> >> >> >> >> >> >> >> >> >> Thanks >> >> >> >> ….. Steve >> >> >> >> >> > >> > >> > -- >> > --- Get your facts first, then you can distort them as you please.-- >> > >> > > > > -- > --- Get your facts first, then you can distort them as you please.-- >
-
Re: Sizing helpRita 2011-11-08, 12:32
Thats a good point. What is hdfs is used as an archive? We dont really use
it for mapreduce more for archival purposes. On Mon, Nov 7, 2011 at 7:53 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > 3x replication has two effects. One is reliability. This is probably > more important in large clusters than small. > > Another important effect is data locality during map-reduce. Having 3x > replication allows mappers to have almost all invocations read from local > disk. 2x replication compromises this. Even where you don't have local > data, the bandwidth available to read from 3x replicated data is 1.5x the > bandwidth available for 2x replication. > > To get a rough feel for how reliable you should consider a cluster, you > can do a pretty simple computation. If you have 12 x 2T on a single > machine and you lose that machine, the remaining copies of that data must > be replicated before another disk fails. With HDFS and block-level > replication, the remaining copies will be spread across the entire cluster > to any disk failure is reasonably like to cause data loss. For a 1000 node > cluster with 12000 disks, it is conservative to estimate a disk failure on > average every 2 hours. Each node will have replicate about 12GB of data > which will take about 500 seconds or about 9 or 10 minutes if you only use > 25% of your network for re-replication. The probability of a disk failure > during a 10 minute period is 1-exp(-10/120) = 8%. This means that roughly > 1 in 12 full machine failures might cause data loss. You can pick > whatever you like for the rate at which nodes die, but I don't think that > this is acceptable. > > My numbers for disk failures are purposely somewhat pessimistic. If you > change the MTBF for disks to 10 years instead of 3 years, then the > probability of data loss after a machine failure drops, but only to about > 2.5%. > > Now, I would be the first to say that these numbers feel too high, but I > also would rather not experience enough data loss events to have a reliable > gut feel for how often they should occur. > > My feeling is that 2x is fine for data you can reconstruct and which you > don't need to read really fast, but not good enough for data whose loss > will get you fired. > > On Mon, Nov 7, 2011 at 7:34 PM, Rita <[EMAIL PROTECTED]> wrote: > >> I have been running with 2x replication on a 500tb cluster. No issues >> whatsoever. 3x is for super paranoid. >> >> >> On Mon, Nov 7, 2011 at 5:06 PM, Ted Dunning <[EMAIL PROTECTED]>wrote: >> >>> Depending on which distribution and what your data center power limits >>> are you may save a lot of money by going with machines that have 12 x 2 or >>> 3 tb drives. With suitable engineering margins and 3 x replication you can >>> have 5 tb net data per node and 20 nodes per rack. If you want to go all >>> cowboy with 2x replication and little space to spare then you can double >>> that density. >>> >>> On Monday, November 7, 2011, Rita <[EMAIL PROTECTED]> wrote: >>> > For a 1PB installation you would need close to 170 servers with 12 TB >>> disk pack installed on them (with replication factor of 2). Thats a >>> conservative estimate >>> > CPUs: 4 cores with 16gb of memory >>> > >>> > Namenode: 4 core with 32gb of memory should be ok. >>> > >>> > >>> > On Fri, Oct 21, 2011 at 5:40 PM, Steve Ed <[EMAIL PROTECTED]> wrote: >>> >> >>> >> I am a newbie to Hadoop and trying to understand how to Size a Hadoop >>> cluster. >>> >> >>> >> >>> >> >>> >> What are factors I should consider deciding the number of datanodes ? >>> >> >>> >> Datanode configuration ? CPU, Memory >>> >> >>> >> Amount of memory required for namenode ? >>> >> >>> >> >>> >> >>> >> My client is looking at 1 PB of usable data and will be running >>> analytics on TB size files using mapreduce. >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> Thanks >>> >> >>> >> ….. Steve >>> >> >>> >> >>> > >>> > >>> > -- >>> > --- Get your facts first, then you can distort them as you please.--
-
Re: Sizing helpTed Dunning 2011-11-08, 12:38
For archival purposes, you don't need speed (mostly). That eliminates one
argument for 3x replication. If you have RAID-5 or RAID-6 on your storage nodes, then you eliminate most of your disk failure costs at the cluster level. This gives you something like 2.2x replication cost. You can also use the cluster level raid. Others will be able to say better whether this is considered standard practice or not. Facebook wrote the code and I believe that they use it. This can decrease replication factor to about 1.2 x as far as space is concerned. I don't know of any failure mode analysis for this usage, however. On Tue, Nov 8, 2011 at 7:32 AM, Rita <[EMAIL PROTECTED]> wrote: > Thats a good point. What is hdfs is used as an archive? We dont really use > it for mapreduce more for archival purposes. > > > On Mon, Nov 7, 2011 at 7:53 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> 3x replication has two effects. One is reliability. This is probably >> more important in large clusters than small. >> >> Another important effect is data locality during map-reduce. Having 3x >> replication allows mappers to have almost all invocations read from local >> disk. 2x replication compromises this. Even where you don't have local >> data, the bandwidth available to read from 3x replicated data is 1.5x the >> bandwidth available for 2x replication. >> >> To get a rough feel for how reliable you should consider a cluster, you >> can do a pretty simple computation. If you have 12 x 2T on a single >> machine and you lose that machine, the remaining copies of that data must >> be replicated before another disk fails. With HDFS and block-level >> replication, the remaining copies will be spread across the entire cluster >> to any disk failure is reasonably like to cause data loss. For a 1000 node >> cluster with 12000 disks, it is conservative to estimate a disk failure on >> average every 2 hours. Each node will have replicate about 12GB of data >> which will take about 500 seconds or about 9 or 10 minutes if you only use >> 25% of your network for re-replication. The probability of a disk failure >> during a 10 minute period is 1-exp(-10/120) = 8%. This means that roughly >> 1 in 12 full machine failures might cause data loss. You can pick >> whatever you like for the rate at which nodes die, but I don't think that >> this is acceptable. >> >> My numbers for disk failures are purposely somewhat pessimistic. If you >> change the MTBF for disks to 10 years instead of 3 years, then the >> probability of data loss after a machine failure drops, but only to about >> 2.5%. >> >> Now, I would be the first to say that these numbers feel too high, but I >> also would rather not experience enough data loss events to have a reliable >> gut feel for how often they should occur. >> >> My feeling is that 2x is fine for data you can reconstruct and which you >> don't need to read really fast, but not good enough for data whose loss >> will get you fired. >> >> On Mon, Nov 7, 2011 at 7:34 PM, Rita <[EMAIL PROTECTED]> wrote: >> >>> I have been running with 2x replication on a 500tb cluster. No issues >>> whatsoever. 3x is for super paranoid. >>> >>> >>> On Mon, Nov 7, 2011 at 5:06 PM, Ted Dunning <[EMAIL PROTECTED]>wrote: >>> >>>> Depending on which distribution and what your data center power limits >>>> are you may save a lot of money by going with machines that have 12 x 2 or >>>> 3 tb drives. With suitable engineering margins and 3 x replication you can >>>> have 5 tb net data per node and 20 nodes per rack. If you want to go all >>>> cowboy with 2x replication and little space to spare then you can double >>>> that density. >>>> >>>> On Monday, November 7, 2011, Rita <[EMAIL PROTECTED]> wrote: >>>> > For a 1PB installation you would need close to 170 servers with 12 TB >>>> disk pack installed on them (with replication factor of 2). Thats a >>>> conservative estimate >>>> > CPUs: 4 cores with 16gb of memory >>>> > >>>>
-
Re: Sizing helpMatt Foley 2011-11-11, 09:57
I agree with Ted's argument that 3x replication is way better than 2x. But
I do have to point out that, since 0.20.204, the loss of a disk no longer causes the loss of a whole node (thankfully!) unless it's the system disk. So in the example given, if you estimate a disk failure every 2 hours, each node only has to re-replicate about 2GB of data, not 12GB. So about 1-in-72 such failures risks data loss, rather than 1-in-12. Which is still unacceptable, so use 3x replication! :-) --Matt On Mon, Nov 7, 2011 at 4:53 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > 3x replication has two effects. One is reliability. This is probably > more important in large clusters than small. > > Another important effect is data locality during map-reduce. Having 3x > replication allows mappers to have almost all invocations read from local > disk. 2x replication compromises this. Even where you don't have local > data, the bandwidth available to read from 3x replicated data is 1.5x the > bandwidth available for 2x replication. > > To get a rough feel for how reliable you should consider a cluster, you > can do a pretty simple computation. If you have 12 x 2T on a single > machine and you lose that machine, the remaining copies of that data must > be replicated before another disk fails. With HDFS and block-level > replication, the remaining copies will be spread across the entire cluster > to any disk failure is reasonably like to cause data loss. For a 1000 node > cluster with 12000 disks, it is conservative to estimate a disk failure on > average every 2 hours. Each node will have replicate about 12GB of data > which will take about 500 seconds or about 9 or 10 minutes if you only use > 25% of your network for re-replication. The probability of a disk failure > during a 10 minute period is 1-exp(-10/120) = 8%. This means that roughly > 1 in 12 full machine failures might cause data loss. You can pick > whatever you like for the rate at which nodes die, but I don't think that > this is acceptable. > > My numbers for disk failures are purposely somewhat pessimistic. If you > change the MTBF for disks to 10 years instead of 3 years, then the > probability of data loss after a machine failure drops, but only to about > 2.5%. > > Now, I would be the first to say that these numbers feel too high, but I > also would rather not experience enough data loss events to have a reliable > gut feel for how often they should occur. > > My feeling is that 2x is fine for data you can reconstruct and which you > don't need to read really fast, but not good enough for data whose loss > will get you fired. > > On Mon, Nov 7, 2011 at 7:34 PM, Rita <[EMAIL PROTECTED]> wrote: > >> I have been running with 2x replication on a 500tb cluster. No issues >> whatsoever. 3x is for super paranoid. >> >> >> On Mon, Nov 7, 2011 at 5:06 PM, Ted Dunning <[EMAIL PROTECTED]>wrote: >> >>> Depending on which distribution and what your data center power limits >>> are you may save a lot of money by going with machines that have 12 x 2 or >>> 3 tb drives. With suitable engineering margins and 3 x replication you can >>> have 5 tb net data per node and 20 nodes per rack. If you want to go all >>> cowboy with 2x replication and little space to spare then you can double >>> that density. >>> >>> On Monday, November 7, 2011, Rita <[EMAIL PROTECTED]> wrote: >>> > For a 1PB installation you would need close to 170 servers with 12 TB >>> disk pack installed on them (with replication factor of 2). Thats a >>> conservative estimate >>> > CPUs: 4 cores with 16gb of memory >>> > >>> > Namenode: 4 core with 32gb of memory should be ok. >>> > >>> > >>> > On Fri, Oct 21, 2011 at 5:40 PM, Steve Ed <[EMAIL PROTECTED]> wrote: >>> >> >>> >> I am a newbie to Hadoop and trying to understand how to Size a Hadoop >>> cluster. >>> >> >>> >> >>> >> >>> >> What are factors I should consider deciding the number of datanodes ? >>> >> >>> >> Datanode configuration ? CPU, Memory
-
Re: Sizing helpKoji Noguchi 2011-11-11, 17:26
Another factor to consider, when disk is bad you may have corrupted blocks which may only get detected by the periodic DataBlockScanner check.
I believe each datanode tries to finish the entire scan in dfs.datanode.scan.period.hours (3weeks default) period. So with 2x replication and some undetected bad disk(s), you can have blocks with effective replication of 1 which would lead to missing blocks eventually. Koji On 11/11/11 1:57 AM, "Matt Foley" <[EMAIL PROTECTED]> wrote: I agree with Ted's argument that 3x replication is way better than 2x. But I do have to point out that, since 0.20.204, the loss of a disk no longer causes the loss of a whole node (thankfully!) unless it's the system disk. So in the example given, if you estimate a disk failure every 2 hours, each node only has to re-replicate about 2GB of data, not 12GB. So about 1-in-72 such failures risks data loss, rather than 1-in-12. Which is still unacceptable, so use 3x replication! :-) --Matt On Mon, Nov 7, 2011 at 4:53 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: 3x replication has two effects. One is reliability. This is probably more important in large clusters than small. Another important effect is data locality during map-reduce. Having 3x replication allows mappers to have almost all invocations read from local disk. 2x replication compromises this. Even where you don't have local data, the bandwidth available to read from 3x replicated data is 1.5x the bandwidth available for 2x replication. To get a rough feel for how reliable you should consider a cluster, you can do a pretty simple computation. If you have 12 x 2T on a single machine and you lose that machine, the remaining copies of that data must be replicated before another disk fails. With HDFS and block-level replication, the remaining copies will be spread across the entire cluster to any disk failure is reasonably like to cause data loss. For a 1000 node cluster with 12000 disks, it is conservative to estimate a disk failure on average every 2 hours. Each node will have replicate about 12GB of data which will take about 500 seconds or about 9 or 10 minutes if you only use 25% of your network for re-replication. The probability of a disk failure during a 10 minute period is 1-exp(-10/120) = 8%. This means that roughly 1 in 12 full machine failures might cause data loss. You can pick whatever you like for the rate at which nodes die, but I don't think that this is acceptable. My numbers for disk failures are purposely somewhat pessimistic. If you change the MTBF for disks to 10 years instead of 3 years, then the probability of data loss after a machine failure drops, but only to about 2.5%. Now, I would be the first to say that these numbers feel too high, but I also would rather not experience enough data loss events to have a reliable gut feel for how often they should occur. My feeling is that 2x is fine for data you can reconstruct and which you don't need to read really fast, but not good enough for data whose loss will get you fired. On Mon, Nov 7, 2011 at 7:34 PM, Rita <[EMAIL PROTECTED]> wrote: I have been running with 2x replication on a 500tb cluster. No issues whatsoever. 3x is for super paranoid. On Mon, Nov 7, 2011 at 5:06 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: Depending on which distribution and what your data center power limits are you may save a lot of money by going with machines that have 12 x 2 or 3 tb drives. With suitable engineering margins and 3 x replication you can have 5 tb net data per node and 20 nodes per rack. If you want to go all cowboy with 2x replication and little space to spare then you can double that density. On Monday, November 7, 2011, Rita <[EMAIL PROTECTED]> wrote: > For a 1PB installation you would need close to 170 servers with 12 TB disk pack installed on them (with replication factor of 2). Thats a conservative estimate > CPUs: 4 cores with 16gb of memory > > Namenode: 4 core with 32gb of memory should be ok.
-
Re: Sizing helpTed Dunning 2011-11-11, 17:49
Matt,
Thanks for pointing that out. I was talking about machine chassis failure since it is the more serious case, but should have pointed out that losing single disks is subject to the same logic with smaller amounts of data. If, however, an installation uses RAID-0 for higher read speed then a disk loss is the same as loss of an entire RAID-0 drive set. In contrast, using RAID-5 or 6 makes loss of a single disk less problematic, but hurts performance. On Fri, Nov 11, 2011 at 9:57 AM, Matt Foley <[EMAIL PROTECTED]> wrote: > I agree with Ted's argument that 3x replication is way better than 2x. > But I do have to point out that, since 0.20.204, the loss of a disk no > longer causes the loss of a whole node (thankfully!) unless it's the system > disk. So in the example given, if you estimate a disk failure every 2 > hours, each node only has to re-replicate about 2GB of data, not 12GB. So > about 1-in-72 such failures risks data loss, rather than 1-in-12. Which is > still unacceptable, so use 3x replication! :-) > --Matt > > On Mon, Nov 7, 2011 at 4:53 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> 3x replication has two effects. One is reliability. This is probably >> more important in large clusters than small. >> >> Another important effect is data locality during map-reduce. Having 3x >> replication allows mappers to have almost all invocations read from local >> disk. 2x replication compromises this. Even where you don't have local >> data, the bandwidth available to read from 3x replicated data is 1.5x the >> bandwidth available for 2x replication. >> >> To get a rough feel for how reliable you should consider a cluster, you >> can do a pretty simple computation. If you have 12 x 2T on a single >> machine and you lose that machine, the remaining copies of that data must >> be replicated before another disk fails. With HDFS and block-level >> replication, the remaining copies will be spread across the entire cluster >> to any disk failure is reasonably like to cause data loss. For a 1000 node >> cluster with 12000 disks, it is conservative to estimate a disk failure on >> average every 2 hours. Each node will have replicate about 12GB of data >> which will take about 500 seconds or about 9 or 10 minutes if you only use >> 25% of your network for re-replication. The probability of a disk failure >> during a 10 minute period is 1-exp(-10/120) = 8%. This means that roughly >> 1 in 12 full machine failures might cause data loss. You can pick >> whatever you like for the rate at which nodes die, but I don't think that >> this is acceptable. >> >> My numbers for disk failures are purposely somewhat pessimistic. If you >> change the MTBF for disks to 10 years instead of 3 years, then the >> probability of data loss after a machine failure drops, but only to about >> 2.5%. >> >> Now, I would be the first to say that these numbers feel too high, but I >> also would rather not experience enough data loss events to have a reliable >> gut feel for how often they should occur. >> >> My feeling is that 2x is fine for data you can reconstruct and which you >> don't need to read really fast, but not good enough for data whose loss >> will get you fired. >> >> On Mon, Nov 7, 2011 at 7:34 PM, Rita <[EMAIL PROTECTED]> wrote: >> >>> I have been running with 2x replication on a 500tb cluster. No issues >>> whatsoever. 3x is for super paranoid. >>> >>> >>> On Mon, Nov 7, 2011 at 5:06 PM, Ted Dunning <[EMAIL PROTECTED]>wrote: >>> >>>> Depending on which distribution and what your data center power limits >>>> are you may save a lot of money by going with machines that have 12 x 2 or >>>> 3 tb drives. With suitable engineering margins and 3 x replication you can >>>> have 5 tb net data per node and 20 nodes per rack. If you want to go all >>>> cowboy with 2x replication and little space to spare then you can double >>>> that density. >>>> >>>> On Monday, November 7, 2011, Rita <[EMAIL PROTECTED]> wrote:
-
RE: Sizing helpSteve Ed 2011-11-11, 17:59
I understand that with 0.20.204, loss of a disk doesn't loss the node. But
if we have to replace that lost disk, its again scheduling the whole node down, kicking replication From: Matt Foley [mailto:[EMAIL PROTECTED]] Sent: Friday, November 11, 2011 1:58 AM To: [EMAIL PROTECTED] Subject: Re: Sizing help I agree with Ted's argument that 3x replication is way better than 2x. But I do have to point out that, since 0.20.204, the loss of a disk no longer causes the loss of a whole node (thankfully!) unless it's the system disk. So in the example given, if you estimate a disk failure every 2 hours, each node only has to re-replicate about 2GB of data, not 12GB. So about 1-in-72 such failures risks data loss, rather than 1-in-12. Which is still unacceptable, so use 3x replication! :-) --Matt On Mon, Nov 7, 2011 at 4:53 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: 3x replication has two effects. One is reliability. This is probably more important in large clusters than small. Another important effect is data locality during map-reduce. Having 3x replication allows mappers to have almost all invocations read from local disk. 2x replication compromises this. Even where you don't have local data, the bandwidth available to read from 3x replicated data is 1.5x the bandwidth available for 2x replication. To get a rough feel for how reliable you should consider a cluster, you can do a pretty simple computation. If you have 12 x 2T on a single machine and you lose that machine, the remaining copies of that data must be replicated before another disk fails. With HDFS and block-level replication, the remaining copies will be spread across the entire cluster to any disk failure is reasonably like to cause data loss. For a 1000 node cluster with 12000 disks, it is conservative to estimate a disk failure on average every 2 hours. Each node will have replicate about 12GB of data which will take about 500 seconds or about 9 or 10 minutes if you only use 25% of your network for re-replication. The probability of a disk failure during a 10 minute period is 1-exp(-10/120) = 8%. This means that roughly 1 in 12 full machine failures might cause data loss. You can pick whatever you like for the rate at which nodes die, but I don't think that this is acceptable. My numbers for disk failures are purposely somewhat pessimistic. If you change the MTBF for disks to 10 years instead of 3 years, then the probability of data loss after a machine failure drops, but only to about 2.5%. Now, I would be the first to say that these numbers feel too high, but I also would rather not experience enough data loss events to have a reliable gut feel for how often they should occur. My feeling is that 2x is fine for data you can reconstruct and which you don't need to read really fast, but not good enough for data whose loss will get you fired. On Mon, Nov 7, 2011 at 7:34 PM, Rita <[EMAIL PROTECTED]> wrote: I have been running with 2x replication on a 500tb cluster. No issues whatsoever. 3x is for super paranoid. On Mon, Nov 7, 2011 at 5:06 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: Depending on which distribution and what your data center power limits are you may save a lot of money by going with machines that have 12 x 2 or 3 tb drives. With suitable engineering margins and 3 x replication you can have 5 tb net data per node and 20 nodes per rack. If you want to go all cowboy with 2x replication and little space to spare then you can double that density. On Monday, November 7, 2011, Rita <[EMAIL PROTECTED]> wrote: > For a 1PB installation you would need close to 170 servers with 12 TB disk pack installed on them (with replication factor of 2). Thats a conservative estimate > CPUs: 4 cores with 16gb of memory > > Namenode: 4 core with 32gb of memory should be ok. > > > On Fri, Oct 21, 2011 at 5:40 PM, Steve Ed <[EMAIL PROTECTED]> wrote: >> >> I am a newbie to Hadoop and trying to understand how to Size a Hadoop cluster. analytics on TB size files using mapreduce.
-
Re: Sizing helpMatt Foley 2011-11-11, 18:15
Nope; hot swap :-)
On Nov 11, 2011, at 9:59 AM, Steve Ed <[EMAIL PROTECTED]> wrote: I understand that with 0.20.204, loss of a disk doesn’t loss the node. But if we have to replace that lost disk, its again scheduling the whole node down, kicking replication *From:* Matt Foley [mailto:[EMAIL PROTECTED]] *Sent:* Friday, November 11, 2011 1:58 AM *To:* [EMAIL PROTECTED] *Subject:* Re: Sizing help I agree with Ted's argument that 3x replication is way better than 2x. But I do have to point out that, since 0.20.204, the loss of a disk no longer causes the loss of a whole node (thankfully!) unless it's the system disk. So in the example given, if you estimate a disk failure every 2 hours, each node only has to re-replicate about 2GB of data, not 12GB. So about 1-in-72 such failures risks data loss, rather than 1-in-12. Which is still unacceptable, so use 3x replication! :-) --Matt On Mon, Nov 7, 2011 at 4:53 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: 3x replication has two effects. One is reliability. This is probably more important in large clusters than small. Another important effect is data locality during map-reduce. Having 3x replication allows mappers to have almost all invocations read from local disk. 2x replication compromises this. Even where you don't have local data, the bandwidth available to read from 3x replicated data is 1.5x the bandwidth available for 2x replication. To get a rough feel for how reliable you should consider a cluster, you can do a pretty simple computation. If you have 12 x 2T on a single machine and you lose that machine, the remaining copies of that data must be replicated before another disk fails. With HDFS and block-level replication, the remaining copies will be spread across the entire cluster to any disk failure is reasonably like to cause data loss. For a 1000 node cluster with 12000 disks, it is conservative to estimate a disk failure on average every 2 hours. Each node will have replicate about 12GB of data which will take about 500 seconds or about 9 or 10 minutes if you only use 25% of your network for re-replication. The probability of a disk failure during a 10 minute period is 1-exp(-10/120) = 8%. This means that roughly 1 in 12 full machine failures might cause data loss. You can pick whatever you like for the rate at which nodes die, but I don't think that this is acceptable. My numbers for disk failures are purposely somewhat pessimistic. If you change the MTBF for disks to 10 years instead of 3 years, then the probability of data loss after a machine failure drops, but only to about 2.5%. Now, I would be the first to say that these numbers feel too high, but I also would rather not experience enough data loss events to have a reliable gut feel for how often they should occur. My feeling is that 2x is fine for data you can reconstruct and which you don't need to read really fast, but not good enough for data whose loss will get you fired. On Mon, Nov 7, 2011 at 7:34 PM, Rita <[EMAIL PROTECTED]> wrote: I have been running with 2x replication on a 500tb cluster. No issues whatsoever. 3x is for super paranoid. On Mon, Nov 7, 2011 at 5:06 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: Depending on which distribution and what your data center power limits are you may save a lot of money by going with machines that have 12 x 2 or 3 tb drives. With suitable engineering margins and 3 x replication you can have 5 tb net data per node and 20 nodes per rack. If you want to go all cowboy with 2x replication and little space to spare then you can double that density. On Monday, November 7, 2011, Rita <[EMAIL PROTECTED]> wrote: > For a 1PB installation you would need close to 170 servers with 12 TB disk pack installed on them (with replication factor of 2). Thats a conservative estimate > CPUs: 4 cores with 16gb of memory > > Namenode: 4 core with 32gb of memory should be ok. > > > On Fri, Oct 21, 2011 at 5:40 PM, Steve Ed <[EMAIL PROTECTED]> wrote: cluster. analytics on TB size files using mapreduce.
-
Re: Sizing helpTodd Lipcon 2011-11-11, 18:37
On Fri, Nov 11, 2011 at 10:15 AM, Matt Foley <[EMAIL PROTECTED]> wrote:
> Nope; hot swap :-) AFAIK you can't re-add the marked-dead disk to the DN, can you? But yea, you can hot-swap the disk, then kick the DN process, which should take less than 10 minutes. That means the NN won't ever notice it's down, and you won't incur any replication costs. -Todd > > On Nov 11, 2011, at 9:59 AM, Steve Ed <[EMAIL PROTECTED]> wrote: > > I understand that with 0.20.204, loss of a disk doesn’t loss the node. But > if we have to replace that lost disk, its again scheduling the whole node > down, kicking replication > > > > From: Matt Foley [mailto:[EMAIL PROTECTED]] > Sent: Friday, November 11, 2011 1:58 AM > To: [EMAIL PROTECTED] > Subject: Re: Sizing help > > > > I agree with Ted's argument that 3x replication is way better than 2x. But > I do have to point out that, since 0.20.204, the loss of a disk no longer > causes the loss of a whole node (thankfully!) unless it's the system disk. > So in the example given, if you estimate a disk failure every 2 hours, each > node only has to re-replicate about 2GB of data, not 12GB. So about 1-in-72 > such failures risks data loss, rather than 1-in-12. Which is still > unacceptable, so use 3x replication! :-) > > --Matt > > On Mon, Nov 7, 2011 at 4:53 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > 3x replication has two effects. One is reliability. This is probably more > important in large clusters than small. > > > > Another important effect is data locality during map-reduce. Having 3x > replication allows mappers to have almost all invocations read from local > disk. 2x replication compromises this. Even where you don't have local > data, the bandwidth available to read from 3x replicated data is 1.5x the > bandwidth available for 2x replication. > > > > To get a rough feel for how reliable you should consider a cluster, you can > do a pretty simple computation. If you have 12 x 2T on a single machine and > you lose that machine, the remaining copies of that data must be replicated > before another disk fails. With HDFS and block-level replication, the > remaining copies will be spread across the entire cluster to any disk > failure is reasonably like to cause data loss. For a 1000 node cluster with > 12000 disks, it is conservative to estimate a disk failure on average every > 2 hours. Each node will have replicate about 12GB of data which will take > about 500 seconds or about 9 or 10 minutes if you only use 25% of your > network for re-replication. The probability of a disk failure during a 10 > minute period is 1-exp(-10/120) = 8%. This means that roughly 1 in 12 full > machine failures might cause data loss. You can pick whatever you like for > the rate at which nodes die, but I don't think that this is acceptable. > > > > My numbers for disk failures are purposely somewhat pessimistic. If you > change the MTBF for disks to 10 years instead of 3 years, then the > probability of data loss after a machine failure drops, but only to about > 2.5%. > > > > Now, I would be the first to say that these numbers feel too high, but I > also would rather not experience enough data loss events to have a reliable > gut feel for how often they should occur. > > > > My feeling is that 2x is fine for data you can reconstruct and which you > don't need to read really fast, but not good enough for data whose loss will > get you fired. > > > > On Mon, Nov 7, 2011 at 7:34 PM, Rita <[EMAIL PROTECTED]> wrote: > > I have been running with 2x replication on a 500tb cluster. No issues > whatsoever. 3x is for super paranoid. > > > > On Mon, Nov 7, 2011 at 5:06 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > Depending on which distribution and what your data center power limits are > you may save a lot of money by going with machines that have 12 x 2 or 3 tb > drives. With suitable engineering margins and 3 x replication you can have > 5 tb net data per node and 20 nodes per rack. If you want to go all cowboy Todd Lipcon Software Engineer, Cloudera
-
Re: Sizing helpMatt Foley 2011-11-15, 22:29
Todd is correct. The capability to recognize repaired disks and
re-incorporate them is not available in the current implementation of disk fail-in-place. So the datanode service does need to be restarted, at which point it will re-join the cluster automatically, with all its working disks. On Fri, Nov 11, 2011 at 10:37 AM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > On Fri, Nov 11, 2011 at 10:15 AM, Matt Foley <[EMAIL PROTECTED]> > wrote: > > Nope; hot swap :-) > > AFAIK you can't re-add the marked-dead disk to the DN, can you? > > But yea, you can hot-swap the disk, then kick the DN process, which > should take less than 10 minutes. That means the NN won't ever notice > it's down, and you won't incur any replication costs. > > -Todd > > > > > On Nov 11, 2011, at 9:59 AM, Steve Ed <[EMAIL PROTECTED]> wrote: > > > > I understand that with 0.20.204, loss of a disk doesn’t loss the node. > But > > if we have to replace that lost disk, its again scheduling the whole node > > down, kicking replication > > > > > > > > From: Matt Foley [mailto:[EMAIL PROTECTED]] > > Sent: Friday, November 11, 2011 1:58 AM > > To: [EMAIL PROTECTED] > > Subject: Re: Sizing help > > > > > > > > I agree with Ted's argument that 3x replication is way better than 2x. > But > > I do have to point out that, since 0.20.204, the loss of a disk no longer > > causes the loss of a whole node (thankfully!) unless it's the system > disk. > > So in the example given, if you estimate a disk failure every 2 hours, > each > > node only has to re-replicate about 2GB of data, not 12GB. So about > 1-in-72 > > such failures risks data loss, rather than 1-in-12. Which is still > > unacceptable, so use 3x replication! :-) > > > > --Matt > > > > On Mon, Nov 7, 2011 at 4:53 PM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > > > 3x replication has two effects. One is reliability. This is probably > more > > important in large clusters than small. > > > > > > > > Another important effect is data locality during map-reduce. Having 3x > > replication allows mappers to have almost all invocations read from local > > disk. 2x replication compromises this. Even where you don't have local > > data, the bandwidth available to read from 3x replicated data is 1.5x the > > bandwidth available for 2x replication. > > > > > > > > To get a rough feel for how reliable you should consider a cluster, you > can > > do a pretty simple computation. If you have 12 x 2T on a single machine > and > > you lose that machine, the remaining copies of that data must be > replicated > > before another disk fails. With HDFS and block-level replication, the > > remaining copies will be spread across the entire cluster to any disk > > failure is reasonably like to cause data loss. For a 1000 node cluster > with > > 12000 disks, it is conservative to estimate a disk failure on average > every > > 2 hours. Each node will have replicate about 12GB of data which will > take > > about 500 seconds or about 9 or 10 minutes if you only use 25% of your > > network for re-replication. The probability of a disk failure during a > 10 > > minute period is 1-exp(-10/120) = 8%. This means that roughly 1 in 12 > full > > machine failures might cause data loss. You can pick whatever you like > for > > the rate at which nodes die, but I don't think that this is acceptable. > > > > > > > > My numbers for disk failures are purposely somewhat pessimistic. If you > > change the MTBF for disks to 10 years instead of 3 years, then the > > probability of data loss after a machine failure drops, but only to about > > 2.5%. > > > > > > > > Now, I would be the first to say that these numbers feel too high, but I > > also would rather not experience enough data loss events to have a > reliable > > gut feel for how often they should occur. > > > > > > > > My feeling is that 2x is fine for data you can reconstruct and which you > > don't need to read really fast, but not good enough for data whose loss |