Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS, mail # user - Sizing help


Copy link to this message
-
Re: Sizing help
Matt Foley 2011-11-15, 22:29
Todd is correct.  The capability to recognize repaired disks and
re-incorporate them is not available in the current implementation of  disk
fail-in-place.  So the datanode service does need to be restarted, at which
point it will re-join the cluster automatically, with all its working disks.

On Fri, Nov 11, 2011 at 10:37 AM, Todd Lipcon <[EMAIL PROTECTED]> wrote:

> On Fri, Nov 11, 2011 at 10:15 AM, Matt Foley <[EMAIL PROTECTED]>
> wrote:
> > Nope; hot swap :-)
>
> AFAIK you can't re-add the marked-dead disk to the DN, can you?
>
> But yea, you can hot-swap the disk, then kick the DN process, which
> should take less than 10 minutes. That means the NN won't ever notice
> it's down, and you won't incur any replication costs.
>
> -Todd
>
> >
> > On Nov 11, 2011, at 9:59 AM, Steve Ed <[EMAIL PROTECTED]> wrote:
> >
> > I understand that with 0.20.204, loss of a disk doesn’t  loss the node.
> But
> > if we have to replace that lost disk, its again scheduling the whole node
> > down, kicking replication
> >
> >
> >
> > From: Matt Foley [mailto:[EMAIL PROTECTED]]
> > Sent: Friday, November 11, 2011 1:58 AM
> > To: [EMAIL PROTECTED]
> > Subject: Re: Sizing help
> >
> >
> >
> > I agree with Ted's argument that 3x replication is way better than 2x.
>  But
> > I do have to point out that, since 0.20.204, the loss of a disk no longer
> > causes the loss of a whole node (thankfully!) unless it's the system
> disk.
> >  So in the example given, if you estimate a disk failure every 2 hours,
> each
> > node only has to re-replicate about 2GB of data, not 12GB.  So about
> 1-in-72
> > such failures risks data loss, rather than 1-in-12.  Which is still
> > unacceptable, so use 3x replication! :-)
> >
> > --Matt
> >
> > On Mon, Nov 7, 2011 at 4:53 PM, Ted Dunning <[EMAIL PROTECTED]>
> wrote:
> >
> > 3x replication has two effects.  One is reliability.  This is probably
> more
> > important in large clusters than small.
> >
> >
> >
> > Another important effect is data locality during map-reduce.  Having 3x
> > replication allows mappers to have almost all invocations read from local
> > disk.  2x replication compromises this.  Even where you don't have local
> > data, the bandwidth available to read from 3x replicated data is 1.5x the
> > bandwidth available for 2x replication.
> >
> >
> >
> > To get a rough feel for how reliable you should consider a cluster, you
> can
> > do a pretty simple computation.  If you have 12 x 2T on a single machine
> and
> > you lose that machine, the remaining copies of that data must be
> replicated
> > before another disk fails.  With HDFS and block-level replication, the
> > remaining copies will be spread across the entire cluster to any disk
> > failure is reasonably like to cause data loss.  For a 1000 node cluster
> with
> > 12000 disks, it is conservative to estimate a disk failure on average
> every
> > 2 hours.  Each node will have replicate about 12GB of data which will
> take
> > about 500 seconds or about 9 or 10 minutes if you only use 25% of your
> > network for re-replication.  The probability of a disk failure  during a
> 10
> > minute period is 1-exp(-10/120) = 8%.  This means that roughly 1 in 12
> full
> > machine failures might cause data loss.   You can pick whatever you like
> for
> > the rate at which nodes die, but I don't think that this is acceptable.
> >
> >
> >
> > My numbers for disk failures are purposely somewhat pessimistic.  If you
> > change the MTBF for disks to 10 years instead of 3 years, then the
> > probability of data loss after a machine failure drops, but only to about
> > 2.5%.
> >
> >
> >
> > Now, I would be the first to say that these numbers feel too high, but I
> > also would rather not experience enough data loss events to have a
> reliable
> > gut feel for how often they should occur.
> >
> >
> >
> > My feeling is that 2x is fine for data you can reconstruct and which you
> > don't need to read really fast, but not good enough for data whose loss