Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Combining MultithreadedMapper threadpool size & map.tasks.maximum


Copy link to this message
-
Re: Combining MultithreadedMapper threadpool size & map.tasks.maximum
Hi Rob
       I'm the culprit who posted the blog. :) The topic was of my interest as well and  I found the conversation informative and useful. Just thought of documenting the same as it could be useful for others as well in future. Hope you don't mind!..

Regards
Bejoy K S

From handheld, Please excuse typos.

-----Original Message-----
From: Rob Stewart <[EMAIL PROTECTED]>
Date: Fri, 10 Feb 2012 18:30:53
To: <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Subject: Re: Combining MultithreadedMapper threadpool size & map.tasks.maximum

Harsh...

Oddly, this blog post has appeared within the last hour or so....

http://kickstarthadoop.blogspot.com/2012/02/enable-multiple-threads-in-mapper-aka.html

--
Rob

On 10 February 2012 14:20, Harsh J <[EMAIL PROTECTED]> wrote:
> Hello again,
>
> On Fri, Feb 10, 2012 at 7:31 PM, Rob Stewart <[EMAIL PROTECTED]> wrote:
>> OK, take word count. The <k,v> to the map is <null,"foo bar lambda
>> beta">. The canonical Hadoop program would tokenize this line of text
>> and output <"foo",1> and so on. How would the multithreadedmapper know
>> how to further divide this line of text into, say: [<null,"foo
>> bar">,<null,"lambda beta">] for 2 threads to run in parallel? Can you
>> somehow provide an additional record reader to split the input to the
>> map task into sub-inputs for each thread?
>
> In MultithreadedMapper, the IO work is still single threaded, while
> the map() calling post-read is multithreaded. But yes you could use a
> mix of CombineFileInputFormat and some custom logic to have multiple
> local splits per map task, and divide readers of them among your
> threads. But why do all this when thats what slots at the TT are for?
> The cost of a single map task failure with your mammoth task approach
> would also be higher - more work to repeat.
>
>> Are you saying here that 4 single-threaded OS processes can achieve a
>> higher rate of OS IO, than 4 threads within one OS process doing IO
>> (which would sound sensible if that's the case).
>
> Yeah thats what I meant, but with the earlier point of "In
> MultithreadedMapper, the IO work is still single threaded"
> specifically in mind.
>
>> The argument against this approach is that the cost starting up OS
>> processes is far more expensive that forking threads within processes.
>> So I would have said the contrary - where map tasks are small and
>> input size is large, than many JVMs would be instantiated throughout
>> the system, one per task. Instead, one might speculate that reducing
>> the number of JVMs, replacing with lower latency thread forking would
>> improve runtime speeds. ?
>
> Agreed here.
>
> The JVM startup overhead does exist but I wouldn't think its too high
> a cost overall, given the simple benefits it can provide instead.
> There is also JVM reuse which makes sense to use for CPU intensive
> applications, so you can take advantage of the HotSpot features of the
> JVM as it gets reused for running tasks of the same job.
>
>> OK, so are you saying:
>> - For CPU intensive tasks, multiple threads might help
>> - For IO intensive tasks, multiple OS processes achieve higher
>> throughput than multiple threads within a smaller number of OS
>> processes?
>
> Yep, but also if you limit your total slots to 1 in favor of going all
> for multi-threading, you won't be able to smoothly run multiple jobs
> at the same time. Tasks from new jobs may have to wait longer to run,
> while in regular slotted environments this is easier to achieve.
>
> --
> Harsh J
> Customer Ops. Engineer
> Cloudera | http://tiny.cloudera.com/about
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB