Oded Rosen 2011-08-10, 09:22
Allen Wittenauer 2011-08-10, 13:50
Oded Rosen 2011-08-10, 14:25
Evert Lammerts 2011-08-10, 14:56
Scott Carey 2011-08-10, 17:24
Ted Dunning 2011-08-10, 17:40
Luke Lu 2011-08-10, 19:19
Brian Bockelman 2011-08-10, 19:31
Ted Dunning 2011-08-10, 19:44
Steve Loughran 2011-08-13, 19:23
Yes, I do have some data to back this up, but I think I mentioned that this
was just the back of an envelope type computation. As such, it necessarily
ignores a number of factors.
Can you say what specifically it is that you object to? Is the analysis
pessimistic or optimistic? Are you seeing lots of correlated failures? I
presume that your 40,000+ nodes are not in a single cluster and thus have
different failure modes than I was talking about. Perhaps you could say
more about your situation.
In many installations, duty factor is low enough that average failure rate
can be an order of magnitude lower than what I quoted. Even so, I don't
feel comfortable using that kind of rate for a computation of this sort.
On Wed, Aug 10, 2011 at 12:19 PM, Luke Lu <[EMAIL PROTECTED]> wrote:
> On Wed, Aug 10, 2011 at 10:40 AM, Ted Dunning <[EMAIL PROTECTED]>
> > To be specific, taking a 100 node x 10 disk x 2 TB configuration with
> > MTBF of 1000 days, we should be seeing drive failures on average once per
> > day....
> > For a 10,000 node cluster, however, we should expect the average rate of
> > disk failure rate of one failure every 2.5 hours.
> Do you have real data to back the analysis? You assume a uniform disk
> failure distribution, which is absolutely not true. I can only say
> that our ops data across 40000+ nodes shows that the above analysis is
> not even close. (This is assuming that the ops know what they are
> doing though :)
Rajiv Chittajallu 2011-08-11, 00:15
Ted Dunning 2011-08-11, 06:13
Steve Loughran 2011-08-13, 19:30
Allen Wittenauer 2011-08-10, 19:04
Scott Carey 2011-08-10, 17:40