Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Combining MultithreadedMapper threadpool size & map.tasks.maximum

Copy link to this message
Re: Combining MultithreadedMapper threadpool size & map.tasks.maximum
Hi Rob
       I'd try to answer this. From my understanding if you are using Multithreaded mapper on word count example with TextInputFormat and imagine you have 2 threads and 2 lines in your input split . RecordReader would read Line 1 and give it to map thread 1 and line 2 to map thread 2. So kind of identical process as defined would be happening with these two lines in parallel. This would be the default behavior.
Bejoy K S

From handheld, Please excuse typos.

-----Original Message-----
From: Rob Stewart <[EMAIL PROTECTED]>
Date: Fri, 10 Feb 2012 18:39:44
Subject: Re: Combining MultithreadedMapper threadpool size & map.tasks.maximum

Thanks, this is a lot clearer. One final question...

On 10 February 2012 14:20, Harsh J <[EMAIL PROTECTED]> wrote:
> Hello again,
> On Fri, Feb 10, 2012 at 7:31 PM, Rob Stewart <[EMAIL PROTECTED]> wrote:
>> OK, take word count. The <k,v> to the map is <null,"foo bar lambda
>> beta">. The canonical Hadoop program would tokenize this line of text
>> and output <"foo",1> and so on. How would the multithreadedmapper know
>> how to further divide this line of text into, say: [<null,"foo
>> bar">,<null,"lambda beta">] for 2 threads to run in parallel? Can you
>> somehow provide an additional record reader to split the input to the
>> map task into sub-inputs for each thread?
> In MultithreadedMapper, the IO work is still single threaded, while
> the map() calling post-read is multithreaded. But yes you could use a
> mix of CombineFileInputFormat and some custom logic to have multiple
> local splits per map task, and divide readers of them among your
> threads. But why do all this when thats what slots at the TT are for?

I'm still unsure how the multi-threaded mapper knows how to split the
input value into chunks, one chunk for each thread. There is only one
example in the Hadoop 0.23 trunk that offers an example:

And in that source code, there is no custom logic for local splits per
map task at all. Again, going back to the word count example. Given a
line of text as input to a map, which comprises of 6 words. I
specificy .setNumberOfThreads( 2 ), so ideally, I'd want 3 words
analysed by one thread, and the 3 to the other. Is what what would
happen? i.e. - I'm unsure whether the multithreadedmapper class does
the splitting of inputs to map tasks...