-Re: Doubt from the book "Definitive Guide"
Mohit Anchlia 2012-04-05, 00:43
On Wed, Apr 4, 2012 at 5:23 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:
> Answers inline.
> On Wed, Apr 4, 2012 at 4:56 PM, Mohit Anchlia <[EMAIL PROTECTED]
> > I am going through the chapter "How mapreduce works" and have some
> > confusion:
> > 1) Below description of Mapper says that reducers get the output file
> > HTTP call. But the description under "The Reduce Side" doesn't
> > say if it's copied using HTTP. So first confusion, Is the output copied
> > from mapper -> reducer or from reducer -> mapper? And second, Is the call
> > http:// or hdfs://
> Map output is written to local FS, not HDFS.
> Thanks! Is there any reason why map output is stored locally and not
stored in HDFS? Wouldn't it make reducers fasters since reducers might be
able to use data locality like mapper tasks do?
> > 2) My understanding was that mapper output gets written to hdfs, since
> > seen part-m-00000 files in hdfs. If mapper output is written to HDFS then
> > shouldn't reducers simply read it from hdfs instead of making http calls
> > tasktrackers location?
> > Map output is sent to HDFS when reducer is not used.
> > ----- from the book ---
> > Mapper
> > The output file’s partitions are made available to the reducers over
> > The number of worker threads used to serve the file partitions is
> > controlled by the tasktracker.http.threads property
> > this setting is per tasktracker, not per map task slot. The default of 40
> > may need increasing for large clusters running large jobs.6.4.2.
> > The Reduce Side
> > Let’s turn now to the reduce part of the process. The map output file is
> > sitting on the local disk of the tasktracker that ran the map task
> > (note that although map outputs always get written to the local disk of
> > map tasktracker, reduce outputs may not be), but now it is needed by the
> > tasktracker
> > that is about to run the reduce task for the partition. Furthermore, the
> > reduce task needs the map output for its particular partition from
> > map tasks across the cluster.
> > The map tasks may finish at different times, so the reduce task starts
> > copying their outputs as soon as each completes. This is known as the
> > phase of the reduce task.
> > The reduce task has a small number of copier threads so that it can fetch
> > map outputs in parallel.
> > The default is five threads, but this number can be changed by setting
> > mapred.reduce.parallel.copies property.