Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Combining MultithreadedMapper threadpool size & map.tasks.maximum

Copy link to this message
Re: Combining MultithreadedMapper threadpool size & map.tasks.maximum

Here is what I understand 

The RecordReader for the MTMappert takes the input split and cycles the records among the available threads. It also ensures that the map outputs are synchronized. 

So what Bejoy says is what will happen for the wordcount program. 


>Sent: Friday, February 10, 2012 11:15 AM
>Subject: Re: Combining MultithreadedMapper threadpool size & map.tasks.maximum
>Hi Rob
>       I'd try to answer this. From my understanding if you are using Multithreaded mapper on word count example with TextInputFormat and imagine you have 2 threads and 2 lines in your input split . RecordReader would read Line 1 and give it to map thread 1 and line 2 to map thread 2. So kind of identical process as defined would be happening with these two lines in parallel. This would be the default behavior.
>Bejoy K S
>From handheld, Please excuse typos.
>-----Original Message-----
>From: Rob Stewart <[EMAIL PROTECTED]>
>Date: Fri, 10 Feb 2012 18:39:44
>Subject: Re: Combining MultithreadedMapper threadpool size & map.tasks.maximum
>Thanks, this is a lot clearer. One final question...
>On 10 February 2012 14:20, Harsh J <[EMAIL PROTECTED]> wrote:
>> Hello again,
>> On Fri, Feb 10, 2012 at 7:31 PM, Rob Stewart <[EMAIL PROTECTED]> wrote:
>>> OK, take word count. The <k,v> to the map is <null,"foo bar lambda
>>> beta">. The canonical Hadoop program would tokenize this line of text
>>> and output <"foo",1> and so on. How would the multithreadedmapper know
>>> how to further divide this line of text into, say: [<null,"foo
>>> bar">,<null,"lambda beta">] for 2 threads to run in parallel? Can you
>>> somehow provide an additional record reader to split the input to the
>>> map task into sub-inputs for each thread?
>> In MultithreadedMapper, the IO work is still single threaded, while
>> the map() calling post-read is multithreaded. But yes you could use a
>> mix of CombineFileInputFormat and some custom logic to have multiple
>> local splits per map task, and divide readers of them among your
>> threads. But why do all this when thats what slots at the TT are for?
>I'm still unsure how the multi-threaded mapper knows how to split the
>input value into chunks, one chunk for each thread. There is only one
>example in the Hadoop 0.23 trunk that offers an example:
>And in that source code, there is no custom logic for local splits per
>map task at all. Again, going back to the word count example. Given a
>line of text as input to a map, which comprises of 6 words. I
>specificy .setNumberOfThreads( 2 ), so ideally, I'd want 3 words
>analysed by one thread, and the 3 to the other. Is what what would
>happen? i.e. - I'm unsure whether the multithreadedmapper class does
>the splitting of inputs to map tasks...