On Thu, Jul 1, 2010 at 3:37 AM, Pedro Costa <[EMAIL PROTECTED]> wrote:
> - I'm running the wordcount example that accepts 3 small txt files as input.
> I assume that there will exist 3 mappers that produce 3 map outputs. One map
> output per txt file, right?
The number of mappers created will depend on the number of
'InputSplits' generated for the program inputs. How to create
InputSplits is customizable. By default for file based inputs, the
splits are computed according to HDFS blocks they are stored as. So,
in the word count example case, there will be one mapper per split,
not per file. Each mapper generates output for all reduces. The number
of reduces is configured by the user. So, if you've configured 2
reduces, each mapper would typically produce 2 map outputs.
> When reduce tasks fetches the map outputs, they have to know which map
> output he can get. How each reduce knows which map output can get? Who give
> this information to him?
Each reduce task is assigned a 'partition' number by the framework
when it is created. A partition number maps to the particular map
output that a reduce task must fetch output from.
> - The distribution of map outputs to the reducers calls partitioning?
The allocation of map outputs to reduces happens via partitioning. The
actual data transfer is referred as the Shuffle phase.
> - The reducers, during the sort phase, knows already which map output he
> should get, right?
During the Shuffle phase.
> On Wed, Jun 30, 2010 at 9:36 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote:
>> On Jun 30, 2010, at 1:29 PM, Pedro Costa wrote:
>>> As I understand from what I've read, the partition has the purpose to
>>> tell to each reducer which map output it will have.
>>> For example, if I've 3 split files and 2 reduces defined in my example,
>>> on the map side it wil be produce 3 map outputs (one map per split file) and
>>> on the reduce side, it will be produced 2 part-* files. The part_00000 it
>>> will contains the results of 2 map outputs and the part_00001 will contain
>>> the results of 1 map output.
>> Typically, each map produces output for each reduce.
>> In your e.g. part-00000 will contain output of reduce-0 and part-00001
>> will contain output of reduce-1.
>>> - My question is, where in the Hadoop MR is set "which Reduce contains
>>> which Map Output"? Is it during the creation of the reduce tasks, or is in
>>> another phase of the MR?
>>> - Can you point me which class does this distribution of map outputs to
>>> the reduce tasks?
>> Take a look at the Partitioner - the partitioner for the job decides which
>> keys are sent to which reduce.