Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Combining MultithreadedMapper threadpool size & map.tasks.maximum


Copy link to this message
-
Re: Combining MultithreadedMapper threadpool size & map.tasks.maximum
Hi Rob
       I'd try to answer this. From my understanding if you are using Multithreaded mapper on word count example with TextInputFormat and imagine you have 2 threads and 2 lines in your input split . RecordReader would read Line 1 and give it to map thread 1 and line 2 to map thread 2. So kind of identical process as defined would be happening with these two lines in parallel. This would be the default behavior.
Regards
Bejoy K S

From handheld, Please excuse typos.

-----Original Message-----
From: Rob Stewart <[EMAIL PROTECTED]>
Date: Fri, 10 Feb 2012 18:39:44
To: <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Subject: Re: Combining MultithreadedMapper threadpool size & map.tasks.maximum

Thanks, this is a lot clearer. One final question...

On 10 February 2012 14:20, Harsh J <[EMAIL PROTECTED]> wrote:
> Hello again,
>
> On Fri, Feb 10, 2012 at 7:31 PM, Rob Stewart <[EMAIL PROTECTED]> wrote:
>> OK, take word count. The <k,v> to the map is <null,"foo bar lambda
>> beta">. The canonical Hadoop program would tokenize this line of text
>> and output <"foo",1> and so on. How would the multithreadedmapper know
>> how to further divide this line of text into, say: [<null,"foo
>> bar">,<null,"lambda beta">] for 2 threads to run in parallel? Can you
>> somehow provide an additional record reader to split the input to the
>> map task into sub-inputs for each thread?
>
> In MultithreadedMapper, the IO work is still single threaded, while
> the map() calling post-read is multithreaded. But yes you could use a
> mix of CombineFileInputFormat and some custom logic to have multiple
> local splits per map task, and divide readers of them among your
> threads. But why do all this when thats what slots at the TT are for?

I'm still unsure how the multi-threaded mapper knows how to split the
input value into chunks, one chunk for each thread. There is only one
example in the Hadoop 0.23 trunk that offers an example:
hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/mapreduce/lib/map/TestMultithreadedMapper.java

And in that source code, there is no custom logic for local splits per
map task at all. Again, going back to the word count example. Given a
line of text as input to a map, which comprises of 6 words. I
specificy .setNumberOfThreads( 2 ), so ideally, I'd want 3 words
analysed by one thread, and the 3 to the other. Is what what would
happen? i.e. - I'm unsure whether the multithreadedmapper class does
the splitting of inputs to map tasks...

Regards,
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB