Steve Ed 2011-10-21, 21:40
Rita 2011-11-07, 10:58
Ted Dunning 2011-11-07, 22:06
Rita 2011-11-08, 00:34
Ted Dunning 2011-11-08, 00:53
Rita 2011-11-08, 12:32
Ted Dunning 2011-11-08, 12:38
Matt Foley 2011-11-11, 09:57
Koji Noguchi 2011-11-11, 17:26
Ted Dunning 2011-11-11, 17:49
Steve Ed 2011-11-11, 17:59
Nope; hot swap :-)
On Nov 11, 2011, at 9:59 AM, Steve Ed <[EMAIL PROTECTED]> wrote:
I understand that with 0.20.204, loss of a disk doesn’t loss the node.
But if we have to replace that lost disk, its again scheduling the whole
node down, kicking replication
*From:* Matt Foley [mailto:[EMAIL PROTECTED]]
*Sent:* Friday, November 11, 2011 1:58 AM
*To:* [EMAIL PROTECTED]
*Subject:* Re: Sizing help
I agree with Ted's argument that 3x replication is way better than 2x. But
I do have to point out that, since 0.20.204, the loss of a disk no longer
causes the loss of a whole node (thankfully!) unless it's the system disk.
So in the example given, if you estimate a disk failure every 2 hours,
each node only has to re-replicate about 2GB of data, not 12GB. So about
1-in-72 such failures risks data loss, rather than 1-in-12. Which is still
unacceptable, so use 3x replication! :-)
On Mon, Nov 7, 2011 at 4:53 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
3x replication has two effects. One is reliability. This is probably more
important in large clusters than small.
Another important effect is data locality during map-reduce. Having 3x
replication allows mappers to have almost all invocations read from local
disk. 2x replication compromises this. Even where you don't have local
data, the bandwidth available to read from 3x replicated data is 1.5x the
bandwidth available for 2x replication.
To get a rough feel for how reliable you should consider a cluster, you can
do a pretty simple computation. If you have 12 x 2T on a single machine
and you lose that machine, the remaining copies of that data must be
replicated before another disk fails. With HDFS and block-level
replication, the remaining copies will be spread across the entire cluster
to any disk failure is reasonably like to cause data loss. For a 1000 node
cluster with 12000 disks, it is conservative to estimate a disk failure on
average every 2 hours. Each node will have replicate about 12GB of data
which will take about 500 seconds or about 9 or 10 minutes if you only use
25% of your network for re-replication. The probability of a disk failure
during a 10 minute period is 1-exp(-10/120) = 8%. This means that roughly
1 in 12 full machine failures might cause data loss. You can pick
whatever you like for the rate at which nodes die, but I don't think that
this is acceptable.
My numbers for disk failures are purposely somewhat pessimistic. If you
change the MTBF for disks to 10 years instead of 3 years, then the
probability of data loss after a machine failure drops, but only to about
Now, I would be the first to say that these numbers feel too high, but I
also would rather not experience enough data loss events to have a reliable
gut feel for how often they should occur.
My feeling is that 2x is fine for data you can reconstruct and which you
don't need to read really fast, but not good enough for data whose loss
will get you fired.
On Mon, Nov 7, 2011 at 7:34 PM, Rita <[EMAIL PROTECTED]> wrote:
I have been running with 2x replication on a 500tb cluster. No issues
whatsoever. 3x is for super paranoid.
On Mon, Nov 7, 2011 at 5:06 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
Depending on which distribution and what your data center power limits are
you may save a lot of money by going with machines that have 12 x 2 or 3 tb
drives. With suitable engineering margins and 3 x replication you can have
5 tb net data per node and 20 nodes per rack. If you want to go all cowboy
with 2x replication and little space to spare then you can double that
On Monday, November 7, 2011, Rita <[EMAIL PROTECTED]> wrote:
> For a 1PB installation you would need close to 170 servers with 12 TB
disk pack installed on them (with replication factor of 2). Thats a
> CPUs: 4 cores with 16gb of memory
> Namenode: 4 core with 32gb of memory should be ok.
> On Fri, Oct 21, 2011 at 5:40 PM, Steve Ed <[EMAIL PROTECTED]> wrote:
analytics on TB size files using mapreduce.
Todd Lipcon 2011-11-11, 18:37
Matt Foley 2011-11-15, 22:29