Also, you may have to adjust your algorithms.
For instance, the conventional standard algorithm for SVD is a Lanczos
iterative algorithm. Iteration in Hadoop is death because of job
invocation time ... what you wind up with is an algorithm that will handle
big data but with a slow-down factor that makes a single node perform at
the same level as 100 Hadoop nodes or more. Scaling with iterative
algorithms like this is irrelevant because of the enormous fixed cost.
On the other hand, you can switch to some of the recently developed
stochastic projection algorithms which give a non-iterative algorithm that
requires 4-7 map-reduce steps (depending on which outputs you need). With
these projection algorithms, Hadoop can out-run other techniques even with
quite modest cluster sizes and will scale linearly.
On Thu, Jan 17, 2013 at 9:47 PM, Stephen Boesch <[EMAIL PROTECTED]> wrote:
> Hi Thiago,
> Subjectively: there are a number of items to consider to achieve nearly
> linear scaling:
> - if the work is well balanced among the tasks - no skew
> - No skew in the association of tasks to nodes. Note: this skew
> actually happens by default if the number of tasks is less than the cluster
> capacity of slots. You will notice that on a cluster with 20 nodes, with
> each node set to 20 mapper tasks, if you launch a job with 20 maps it may
> well have all of them running on one node.
> - with higher number of tasks the risk of having stragglers affecting
> overall throughput/performance increases unless speculative execution were
> set properly
> - hadoop configuration settings come under more pressure with more
> - properly tuning the number of mappers and reducers to (a) your node
> and cluster characteristics and (b) the particular tasks has a large impact
> on performance. In my experience the settings are often set too
> conservatively / too low to take advantage of the node and cluster
> So in summary hadoop itself is capable of nearly linear scaling to low
> thousands of nodes, but configuring the cluster to really achieve that
> requires effort.
> 2013/1/17 Thiago Vieira <[EMAIL PROTECTED]>
>> Is common to see this sentence: "Hadoop Scales Linearly". But, is there
>> any performance evaluation to confirm this?
>> In my evaluations, Hadoop processing capacity scales linearly, but not
>> proportional to number of nodes, the processing capacity achieved with 20
>> nodes is not the double of the processing capacity achieved with 10 nodes.
>> Is there any evaluation about this?
>> Thank you!
>> Thiago Vieira