Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Re: Estimating disk space requirements


Copy link to this message
-
Re: Estimating disk space requirements
Ted Dunning 2013-01-19, 03:39
Jeff makes some good points here.

On Fri, Jan 18, 2013 at 5:01 PM, Jeffrey Buell <[EMAIL PROTECTED]> wrote:

> I disagree.  There are some significant advantages to using "many small
> nodes" instead of "few big nodes".  As Ted points out, there are some
> disadvantages as well, so you have to look at the trade-offs.  But consider:
>
> - NUMA:  If your hadoop nodes span physical NUMA nodes, then performance
> will suffer from remote memory accesses.  The Linux scheduler tries to
> minimize this, but I've found that about 1/3 of memory accesses are remote
> on a 2-socket machine.  This effect will be more severe on bigger
> machines.  Hadoop nodes that fit on a NUMA node will have not access remote
> memory at all (at least on vSphere).
>

This is definitely a good point with respect to untainted Hadoop, but with
a system like MapR, there is a significant amount of core locality that
goes on to minimize NUMA-remote fetches.  This can have significant impact,
of course.

- Disk partitioning:  Smaller nodes with fewer disks each can significantly
> increase average disk utilization, not decrease it.  Having many threads
> operating against many disks in the "big node" case tends to leave some
> disks idle while others are over-subscribed.
>

Again, this is an implementation side-effect.  Good I/O scheduling and
proper striping can mitigate this substantially.

Going the other way, splitting disks between different VM's can be
disastrous.
>  Partitioning disks among nodes decreases this effect.  The extreme case
> is one disk per node, where no disks will be idle as long as there is work
> to do.
>

Yes.  Even deficient implementations should succeed in this case.

You do lose the ability to allow big-memory jobs that would otherwise span
multiple slots.
> - Management: Not a performance effect, but smaller nodes enable easier
> multi-tenancy, multiple virtual Hadoop clusters, sharing physical hardware
> with other workloads, etc.
>

Definitely true.