-Re: questions about hadoop map reduce and compute intensive related applications
Ted Dunning 2011-04-30, 20:30
On Sat, Apr 30, 2011 at 12:18 AM, elton sky <[EMAIL PROTECTED]> wrote:
> I got 2 questions:
> 1. I am wondering how hadoop MR performs when it runs compute intensive
> applications, e.g. Monte carlo method compute PI. There's a example in
> QuasiMonteCarlo, but that example doesn't use random number and it
> psudo input upfront. If we use distributed random number generation, then I
> guess the performance of hadoop should be similar with some message passing
> framework, like MPI. So my guess is by using proper method hadoop would be
> good in compute intensive applications compared with MPI.
Not quite sure what algorithms you mean here, but for trivial parallelism,
map-reduce is a fine way to go.
MPI supports node-to-node communications in ways that map-reduce does not,
however, which requires that you iterate map-reduce steps for many
algorithms. With Hadoop's current implementation, this is horrendously
slow (minimum 20-30 seconds per iteration).
Sometimes you can avoid this by clever tricks. For instance, random
projection can compute the key step in an SVD decomposition with one
map-reduce while the comparable Lanczos algorithm requires more than one
step per eigenvector (and we often want 100 of them!).
Sometimes, however, there are no known algorithms that avoid the need for
repeated communication. For these problems, Hadoop as it stands may be a
poor fit. Help is on the way, however, with the MapReduce 2.0 work because
that will allow much more flexible models of computation.
> 2. I am looking for some applications, which has large data sets and
> requires intensive computation. An application can be divided into a
> workflow, including either map reduce operations, and message passing like
> operations. For example, in step 1 I use hadoop MR processes 10TB of data
> and generates small output, say, 10GB. This 10GB can be fit into memory and
> they are better be processed with some interprocess communication, which
> will boost the performance. So in step 2 I will use MPI, etc.
Some machine learning algorithms require features that are much smaller than
the original input. This leads to exactly the pattern you describe.
Integrating MPI with map-reduce is currently difficult and/or very ugly,
however. Not impossible and there are hackish ways to do the job, but they