Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Hybrid Hadoop with fork/join ?


+
Rob Stewart 2012-01-31, 15:51
Copy link to this message
-
Re: Hybrid Hadoop with fork/join ?
Rob,

Hadoop has as a way to run Map tasks in multithreading mode, look for the
MultithreadedMapRunner & MultithreadedMapper.

Thanks.

Alejandro.

On Tue, Jan 31, 2012 at 7:51 AM, Rob Stewart <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I'm investigating the feasibility of a hybrid approach to parallel
> programming, by fusing together the concurrent Java fork/join
> libraries with Hadoop... MapReduce, a paradigm suited for scalable
> execution over distributed memory + fork/join, a paradigm for optimal
> multi-threaded shared memory execution.
>
> I am aware that, to a degree, Hadoop can take advantage of multiple
> core on a compute node, by setting the
> mapred.tasktracker.<map/reduce>.tasks.maximum to be more than one. As
> it states in the "Hadoop: The definitive guide" book, each task runs
> in a separate JVM. So setting maximum map tasks to 3, will allow the
> possibility of 3 JVMs running on the Operating System, right?
>
> As an alternative to this approach, I am looking to introduce
> fork/join into map tasks, and perhaps maybe too, throttling down the
> maximum number of map tasks per node to 1. I would implement a
> RecordReader that produces map tasks for more coarse granularity, and
> that fork/join would subdivide to map task using multiple threads -
> where the output of the join would be the output of the map task.
>
> The motivation for this is that threads are a lot more lightweight
> than initializing JVMs, which as the Hadoop book points out, takes a
> second or so each time, unless mapred.job.resuse.jvm.num.tasks is set
> higher than 1 (the default is 1). So for example, opting for small
> granular maps for a particular job for a given input generates 1,000
> map tasks, which will mean that 1,000 JVMs will be created on the
> Hadoop cluster within the execution of the program. If I were to write
> a bespoke RecordReader to increase the granularity of each map, so
> much so that only 100 map tasks are created. The embedded fork/join
> code would further split each map into 10 threads, to evaluate within
> one JVM concurrently. I would expect this latter approach of multiple
> threads to have better performance, than the clunkier multple-JVM
> approach.
>
> Has such a hybrid approach combining MapReduce with ForkJoin been
> investigated for feasibility, or similar studies published? Are there
> any important technical limitations that I should consider? Any
> thoughts on the proposed multi-threaded distributed-shared memory
> architecture are much appreciated from the Hadoop community!
>
> --
> Rob Stewart
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB