Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Re: Estimating disk space requirements


Copy link to this message
-
Re: Estimating disk space requirements
Jeff makes some good points here.

On Fri, Jan 18, 2013 at 5:01 PM, Jeffrey Buell <[EMAIL PROTECTED]> wrote:

> I disagree.  There are some significant advantages to using "many small
> nodes" instead of "few big nodes".  As Ted points out, there are some
> disadvantages as well, so you have to look at the trade-offs.  But consider:
>
> - NUMA:  If your hadoop nodes span physical NUMA nodes, then performance
> will suffer from remote memory accesses.  The Linux scheduler tries to
> minimize this, but I've found that about 1/3 of memory accesses are remote
> on a 2-socket machine.  This effect will be more severe on bigger
> machines.  Hadoop nodes that fit on a NUMA node will have not access remote
> memory at all (at least on vSphere).
>

This is definitely a good point with respect to untainted Hadoop, but with
a system like MapR, there is a significant amount of core locality that
goes on to minimize NUMA-remote fetches.  This can have significant impact,
of course.

- Disk partitioning:  Smaller nodes with fewer disks each can significantly
> increase average disk utilization, not decrease it.  Having many threads
> operating against many disks in the "big node" case tends to leave some
> disks idle while others are over-subscribed.
>

Again, this is an implementation side-effect.  Good I/O scheduling and
proper striping can mitigate this substantially.

Going the other way, splitting disks between different VM's can be
disastrous.
>  Partitioning disks among nodes decreases this effect.  The extreme case
> is one disk per node, where no disks will be idle as long as there is work
> to do.
>

Yes.  Even deficient implementations should succeed in this case.

You do lose the ability to allow big-memory jobs that would otherwise span
multiple slots.
> - Management: Not a performance effect, but smaller nodes enable easier
> multi-tenancy, multiple virtual Hadoop clusters, sharing physical hardware
> with other workloads, etc.
>

Definitely true.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB