Subjectively: there are a number of items to consider to achieve nearly
- if the work is well balanced among the tasks - no skew
- No skew in the association of tasks to nodes. Note: this skew actually
happens by default if the number of tasks is less than the cluster capacity
of slots. You will notice that on a cluster with 20 nodes, with each node
set to 20 mapper tasks, if you launch a job with 20 maps it may well have
all of them running on one node.
- with higher number of tasks the risk of having stragglers affecting
overall throughput/performance increases unless speculative execution were
- hadoop configuration settings come under more pressure with more
- properly tuning the number of mappers and reducers to (a) your node
and cluster characteristics and (b) the particular tasks has a large impact
on performance. In my experience the settings are often set too
conservatively / too low to take advantage of the node and cluster
So in summary hadoop itself is capable of nearly linear scaling to low
thousands of nodes, but configuring the cluster to really achieve that
2013/1/17 Thiago Vieira <[EMAIL PROTECTED]>
> Is common to see this sentence: "Hadoop Scales Linearly". But, is there
> any performance evaluation to confirm this?
> In my evaluations, Hadoop processing capacity scales linearly, but not
> proportional to number of nodes, the processing capacity achieved with 20
> nodes is not the double of the processing capacity achieved with 10 nodes.
> Is there any evaluation about this?
> Thank you!
> Thiago Vieira