-Re: Combining MultithreadedMapper threadpool size & map.tasks.maximum
bejoy.hadoop@... 2012-02-10, 19:00
I'm the culprit who posted the blog. :) The topic was of my interest as well and I found the conversation informative and useful. Just thought of documenting the same as it could be useful for others as well in future. Hope you don't mind!..
Bejoy K S
From handheld, Please excuse typos.
From: Rob Stewart <[EMAIL PROTECTED]>
Date: Fri, 10 Feb 2012 18:30:53
To: <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Subject: Re: Combining MultithreadedMapper threadpool size & map.tasks.maximum
Oddly, this blog post has appeared within the last hour or so....
On 10 February 2012 14:20, Harsh J <[EMAIL PROTECTED]> wrote:
> Hello again,
> On Fri, Feb 10, 2012 at 7:31 PM, Rob Stewart <[EMAIL PROTECTED]> wrote:
>> OK, take word count. The <k,v> to the map is <null,"foo bar lambda
>> beta">. The canonical Hadoop program would tokenize this line of text
>> and output <"foo",1> and so on. How would the multithreadedmapper know
>> how to further divide this line of text into, say: [<null,"foo
>> bar">,<null,"lambda beta">] for 2 threads to run in parallel? Can you
>> somehow provide an additional record reader to split the input to the
>> map task into sub-inputs for each thread?
> In MultithreadedMapper, the IO work is still single threaded, while
> the map() calling post-read is multithreaded. But yes you could use a
> mix of CombineFileInputFormat and some custom logic to have multiple
> local splits per map task, and divide readers of them among your
> threads. But why do all this when thats what slots at the TT are for?
> The cost of a single map task failure with your mammoth task approach
> would also be higher - more work to repeat.
>> Are you saying here that 4 single-threaded OS processes can achieve a
>> higher rate of OS IO, than 4 threads within one OS process doing IO
>> (which would sound sensible if that's the case).
> Yeah thats what I meant, but with the earlier point of "In
> MultithreadedMapper, the IO work is still single threaded"
> specifically in mind.
>> The argument against this approach is that the cost starting up OS
>> processes is far more expensive that forking threads within processes.
>> So I would have said the contrary - where map tasks are small and
>> input size is large, than many JVMs would be instantiated throughout
>> the system, one per task. Instead, one might speculate that reducing
>> the number of JVMs, replacing with lower latency thread forking would
>> improve runtime speeds. ?
> Agreed here.
> The JVM startup overhead does exist but I wouldn't think its too high
> a cost overall, given the simple benefits it can provide instead.
> There is also JVM reuse which makes sense to use for CPU intensive
> applications, so you can take advantage of the HotSpot features of the
> JVM as it gets reused for running tasks of the same job.
>> OK, so are you saying:
>> - For CPU intensive tasks, multiple threads might help
>> - For IO intensive tasks, multiple OS processes achieve higher
>> throughput than multiple threads within a smaller number of OS
> Yep, but also if you limit your total slots to 1 in favor of going all
> for multi-threading, you won't be able to smoothly run multiple jobs
> at the same time. Tasks from new jobs may have to wait longer to run,
> while in regular slotted environments this is easier to achieve.
> Harsh J
> Customer Ops. Engineer
> Cloudera | http://tiny.cloudera.com/about