-Re: Number of Maps running more than expected
Bertrand Dechoux 2012-08-16, 21:01
Well, there is speculative executions too.
*Speculative execution:* One problem with the Hadoop system is that by
> dividing the tasks across many nodes, it is possible for a few slow nodes
> to rate-limit the rest of the program. For example if one node has a slow
> disk controller, then it may be reading its input at only 10% the speed of
> all the other nodes. So when 99 map tasks are already complete, the system
> is still waiting for the final map task to check in, which takes much
> longer than all the other nodes.
> By forcing tasks to run in isolation from one another, individual tasks do
> not know *where* their inputs come from. Tasks trust the Hadoop platform
> to just deliver the appropriate input. Therefore, the same input can be
> processed *multiple times in parallel*, to exploit differences in machine
> capabilities. As most of the tasks in a job are coming to a close, the
> Hadoop platform will schedule redundant copies of the remaining tasks
> across several nodes which do not have other work to perform. This process
> is known as *speculative execution*. When tasks complete, they announce
> this fact to the JobTracker. Whichever copy of a task finishes first
> becomes the definitive copy. If other copies were executing speculatively,
> Hadoop tells the TaskTrackers to abandon the tasks and discard their
> outputs. The Reducers then receive their inputs from whichever Mapper
> completed successfully, first.
> Speculative execution is enabled by default. You can disable speculative
> execution for the mappers and reducers by setting the
> mapred.map.tasks.speculative.execution and
> mapred.reduce.tasks.speculative.execution JobConf options to false,
Can you tell us your configuration with regards to those parameters?
On Thu, Aug 16, 2012 at 8:36 PM, in.abdul <[EMAIL PROTECTED]> wrote:
> Hi Gaurav,
> Number map is not depents upon number block . It is really depends upon
> number of input splits . If you had 100GB of data and you had 10 split
> means then you can see only 10 maps .
> Please correct me if i am wrong
> Thanks and regards,
> Syed abdul kather
> On Aug 16, 2012 7:44 PM, "Gaurav Dasgupta [via Lucene]" <
> ml-node+[EMAIL PROTECTED]> wrote:
> > Hi users,
> > I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all
> > the 12 nodes and 1 node running the Job Tracker).
> > In order to perform a WordCount benchmark test, I did the following:
> > - Executed "RandomTextWriter" first to create 100 GB data (Note that I
> > have changed the "test.randomtextwrite.total_bytes" parameter only,
> > all are kept default).
> > - Next, executed the "WordCount" program for that 100 GB dataset.
> > The "Block Size" in "hdfs-site.xml" is set as 128 MB. Now, according to
> > calculation, total number of Maps to be executed by the wordcount job
> > should be 100 GB / 128 MB or 102400 MB / 128 MB = 800.
> > But when I am executing the job, it is running a total number of 900
> > i.e., 100 extra.
> > So, why this extra number of Maps? Although, my job is completing
> > successfully without any error.
> > Again, if I don't execute the "RandomTextWwriter" job to create data for
> > my wordcount, rather I put my own 100 GB text file in HDFS and run
> > "WordCount", I can then see the number of Maps are equivalent to my
> > calculation, i.e., 800.
> > Can anyone tell me why this odd behaviour of Hadoop regarding the number
> > of Maps for WordCount only when the dataset is generated by
> > RandomTextWriter? And what is the purpose of these extra number of Maps?
> > Regards,
> > Gaurav Dasgupta
> > ------------------------------
> > If you reply to this email, your message will be added to the discussion
> > below:
> > To unsubscribe from Lucene, click here<