Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Re: Estimating disk space requirements

Copy link to this message
Re: Estimating disk space requirements
I disagree. There are some significant advantages to using "many small nodes" instead of "few big nodes". As Ted points out, there are some disadvantages as well, so you have to look at the trade-offs. But consider:

- NUMA: If your hadoop nodes span physical NUMA nodes, then performance will suffer from remote memory accesses. The Linux scheduler tries to minimize this, but I've found that about 1/3 of memory accesses are remote on a 2-socket machine. This effect will be more severe on bigger machines. Hadoop nodes that fit on a NUMA node will have not access remote memory at all (at least on vSphere).

- Disk partitioning: Smaller nodes with fewer disks each can significantly increase average disk utilization, not decrease it. Having many threads operating against many disks in the "big node" case tends to leave some disks idle while others are over-subscribed. Partitioning disks among nodes decreases this effect. The extreme case is one disk per node, where no disks will be idle as long as there is work to do.

- Management: Not a performance effect, but smaller nodes enable easier multi-tenancy, multiple virtual Hadoop clusters, sharing physical hardware with other workloads, etc.


----- Original Message -----

From: "Ted Dunning" <[EMAIL PROTECTED]>
Sent: Friday, January 18, 2013 3:36:30 PM
Subject: Re: Estimating disk space requirements

If you make 20 individual small servers, that isn't much different from 20 from one server. The only difference would be if the neighbors of the separate VMs use less resource.
On Fri, Jan 18, 2013 at 3:34 PM, Panshul Whisper < [EMAIL PROTECTED] > wrote:

ah now i understand what you mean.
I will be creating 20 individual servers on the cloud, and not create one big server and make several virtual nodes inside it.
I will be paying for 20 different nodes.. all configured with hadoop and connected to the cluster.
Thanx for the intel :)

On Fri, Jan 18, 2013 at 11:59 PM, Ted Dunning < [EMAIL PROTECTED] > wrote:

It is usually better to not subdivide nodes into virtual nodes. You will generally get better performance form the original node because you only pay for the OS once and because your disk I/O will be scheduled better.
If you look at EC2 pricing, however, the spot market often has arbitrage opportunities where one size node is absurdly cheap relative to others. In that case, it pays to scale the individual nodes up or down.
The only reasonable reason to split nodes to very small levels is for testing and training.
On Fri, Jan 18, 2013 at 2:30 PM, Panshul Whisper < [EMAIL PROTECTED] > wrote:


Thnx for the reply Ted,
You can find 40 GB disks when u make virtual nodes on a cloud like Rackspace ;-)
About the os partitions I did not exactly understand what you meant.
I have made a server on the cloud.. And I just installed and configured hadoop and hbase in the /use/local folder.
And I am pretty sure it does not have a separate partition for root.
Please help me explain what u meant and what else precautions should I take.
Ouch Whisper
On Jan 18, 2013 11:11 PM, "Ted Dunning" < [EMAIL PROTECTED] > wrote:

Where do you find 40gb disks now a days?
Normally your performance is going to be better with more space but your network may be your limiting factor for some computations. That could give you some paradoxical scaling. Hbase will rarely show this behavior.
Keep in mind you also want to allow for an os partition. Current standard practice is to reserve as much as 100 GB for that partition but in your case 10gb better:-)
Note that if you account for this, the node counts don't scale as simply. The overhead of these os partitions goes up with number of nodes.

On Jan 18, 2013, at 8:55 AM, Panshul Whisper < [EMAIL PROTECTED] > wrote:

If we look at it with performance in mind,
is it better to have 20 Nodes with 40 GB HDD
or is it better to have 10 Nodes with 80 GB HDD?
they are connected on a gigabit LAN

On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari < [EMAIL PROTECTED] > wrote:

20 nodes with 40 GB will do the work.

After that you will have to consider performances based on your access
pattern. But that's another story.


2013/1/18, Panshul Whisper < [EMAIL PROTECTED] >:
Regards, Ouch Whisper


Regards, Ouch Whisper