Steve Ed 2011-10-21, 21:40
Rita 2011-11-07, 10:58
Ted Dunning 2011-11-07, 22:06
Rita 2011-11-08, 00:34
Ted Dunning 2011-11-08, 00:53
Rita 2011-11-08, 12:32
For archival purposes, you don't need speed (mostly). That eliminates one
argument for 3x replication.
If you have RAID-5 or RAID-6 on your storage nodes, then you eliminate most
of your disk failure costs at the cluster level. This gives you something
like 2.2x replication cost.
You can also use the cluster level raid. Others will be able to say better
whether this is considered standard practice or not. Facebook wrote the
code and I believe that they use it. This can decrease replication factor
to about 1.2 x as far as space is concerned. I don't know of any failure
mode analysis for this usage, however.
On Tue, Nov 8, 2011 at 7:32 AM, Rita <[EMAIL PROTECTED]> wrote:
> Thats a good point. What is hdfs is used as an archive? We dont really use
> it for mapreduce more for archival purposes.
> On Mon, Nov 7, 2011 at 7:53 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>> 3x replication has two effects. One is reliability. This is probably
>> more important in large clusters than small.
>> Another important effect is data locality during map-reduce. Having 3x
>> replication allows mappers to have almost all invocations read from local
>> disk. 2x replication compromises this. Even where you don't have local
>> data, the bandwidth available to read from 3x replicated data is 1.5x the
>> bandwidth available for 2x replication.
>> To get a rough feel for how reliable you should consider a cluster, you
>> can do a pretty simple computation. If you have 12 x 2T on a single
>> machine and you lose that machine, the remaining copies of that data must
>> be replicated before another disk fails. With HDFS and block-level
>> replication, the remaining copies will be spread across the entire cluster
>> to any disk failure is reasonably like to cause data loss. For a 1000 node
>> cluster with 12000 disks, it is conservative to estimate a disk failure on
>> average every 2 hours. Each node will have replicate about 12GB of data
>> which will take about 500 seconds or about 9 or 10 minutes if you only use
>> 25% of your network for re-replication. The probability of a disk failure
>> during a 10 minute period is 1-exp(-10/120) = 8%. This means that roughly
>> 1 in 12 full machine failures might cause data loss. You can pick
>> whatever you like for the rate at which nodes die, but I don't think that
>> this is acceptable.
>> My numbers for disk failures are purposely somewhat pessimistic. If you
>> change the MTBF for disks to 10 years instead of 3 years, then the
>> probability of data loss after a machine failure drops, but only to about
>> Now, I would be the first to say that these numbers feel too high, but I
>> also would rather not experience enough data loss events to have a reliable
>> gut feel for how often they should occur.
>> My feeling is that 2x is fine for data you can reconstruct and which you
>> don't need to read really fast, but not good enough for data whose loss
>> will get you fired.
>> On Mon, Nov 7, 2011 at 7:34 PM, Rita <[EMAIL PROTECTED]> wrote:
>>> I have been running with 2x replication on a 500tb cluster. No issues
>>> whatsoever. 3x is for super paranoid.
>>> On Mon, Nov 7, 2011 at 5:06 PM, Ted Dunning <[EMAIL PROTECTED]>wrote:
>>>> Depending on which distribution and what your data center power limits
>>>> are you may save a lot of money by going with machines that have 12 x 2 or
>>>> 3 tb drives. With suitable engineering margins and 3 x replication you can
>>>> have 5 tb net data per node and 20 nodes per rack. If you want to go all
>>>> cowboy with 2x replication and little space to spare then you can double
>>>> that density.
>>>> On Monday, November 7, 2011, Rita <[EMAIL PROTECTED]> wrote:
>>>> > For a 1PB installation you would need close to 170 servers with 12 TB
>>>> disk pack installed on them (with replication factor of 2). Thats a
>>>> conservative estimate
>>>> > CPUs: 4 cores with 16gb of memory
Matt Foley 2011-11-11, 09:57
Koji Noguchi 2011-11-11, 17:26
Ted Dunning 2011-11-11, 17:49
Steve Ed 2011-11-11, 17:59
Matt Foley 2011-11-11, 18:15
Todd Lipcon 2011-11-11, 18:37
Matt Foley 2011-11-15, 22:29