Rob Stewart 2012-01-31, 15:51
Hadoop has as a way to run Map tasks in multithreading mode, look for the
MultithreadedMapRunner & MultithreadedMapper.
On Tue, Jan 31, 2012 at 7:51 AM, Rob Stewart <[EMAIL PROTECTED]> wrote:
> I'm investigating the feasibility of a hybrid approach to parallel
> programming, by fusing together the concurrent Java fork/join
> libraries with Hadoop... MapReduce, a paradigm suited for scalable
> execution over distributed memory + fork/join, a paradigm for optimal
> multi-threaded shared memory execution.
> I am aware that, to a degree, Hadoop can take advantage of multiple
> core on a compute node, by setting the
> mapred.tasktracker.<map/reduce>.tasks.maximum to be more than one. As
> it states in the "Hadoop: The definitive guide" book, each task runs
> in a separate JVM. So setting maximum map tasks to 3, will allow the
> possibility of 3 JVMs running on the Operating System, right?
> As an alternative to this approach, I am looking to introduce
> fork/join into map tasks, and perhaps maybe too, throttling down the
> maximum number of map tasks per node to 1. I would implement a
> RecordReader that produces map tasks for more coarse granularity, and
> that fork/join would subdivide to map task using multiple threads -
> where the output of the join would be the output of the map task.
> The motivation for this is that threads are a lot more lightweight
> than initializing JVMs, which as the Hadoop book points out, takes a
> second or so each time, unless mapred.job.resuse.jvm.num.tasks is set
> higher than 1 (the default is 1). So for example, opting for small
> granular maps for a particular job for a given input generates 1,000
> map tasks, which will mean that 1,000 JVMs will be created on the
> Hadoop cluster within the execution of the program. If I were to write
> a bespoke RecordReader to increase the granularity of each map, so
> much so that only 100 map tasks are created. The embedded fork/join
> code would further split each map into 10 threads, to evaluate within
> one JVM concurrently. I would expect this latter approach of multiple
> threads to have better performance, than the clunkier multple-JVM
> Has such a hybrid approach combining MapReduce with ForkJoin been
> investigated for feasibility, or similar studies published? Are there
> any important technical limitations that I should consider? Any
> thoughts on the proposed multi-threaded distributed-shared memory
> architecture are much appreciated from the Hadoop community!
> Rob Stewart