-Re: Number of Maps running more than expected
Bertrand Dechoux 2012-08-16, 21:35
Also could you tell us more about your task statuses?
You might also have failed tasks...
On Thu, Aug 16, 2012 at 11:01 PM, Bertrand Dechoux <[EMAIL PROTECTED]>wrote:
> Well, there is speculative executions too.
> *Speculative execution:* One problem with the Hadoop system is that by
>> dividing the tasks across many nodes, it is possible for a few slow nodes
>> to rate-limit the rest of the program. For example if one node has a slow
>> disk controller, then it may be reading its input at only 10% the speed of
>> all the other nodes. So when 99 map tasks are already complete, the system
>> is still waiting for the final map task to check in, which takes much
>> longer than all the other nodes.
>> By forcing tasks to run in isolation from one another, individual tasks
>> do not know *where* their inputs come from. Tasks trust the Hadoop
>> platform to just deliver the appropriate input. Therefore, the same input
>> can be processed *multiple times in parallel*, to exploit differences in
>> machine capabilities. As most of the tasks in a job are coming to a close,
>> the Hadoop platform will schedule redundant copies of the remaining tasks
>> across several nodes which do not have other work to perform. This process
>> is known as *speculative execution*. When tasks complete, they announce
>> this fact to the JobTracker. Whichever copy of a task finishes first
>> becomes the definitive copy. If other copies were executing speculatively,
>> Hadoop tells the TaskTrackers to abandon the tasks and discard their
>> outputs. The Reducers then receive their inputs from whichever Mapper
>> completed successfully, first.
>> Speculative execution is enabled by default. You can disable speculative
>> execution for the mappers and reducers by setting the
>> mapred.map.tasks.speculative.execution and
>> mapred.reduce.tasks.speculative.execution JobConf options to false,
> Can you tell us your configuration with regards to those parameters?
> On Thu, Aug 16, 2012 at 8:36 PM, in.abdul <[EMAIL PROTECTED]> wrote:
>> Hi Gaurav,
>> Number map is not depents upon number block . It is really depends upon
>> number of input splits . If you had 100GB of data and you had 10 split
>> means then you can see only 10 maps .
>> Please correct me if i am wrong
>> Thanks and regards,
>> Syed abdul kather
>> On Aug 16, 2012 7:44 PM, "Gaurav Dasgupta [via Lucene]" <
>> ml-node+[EMAIL PROTECTED]> wrote:
>> > Hi users,
>> > I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all
>> > the 12 nodes and 1 node running the Job Tracker).
>> > In order to perform a WordCount benchmark test, I did the following:
>> > - Executed "RandomTextWriter" first to create 100 GB data (Note that
>> > have changed the "test.randomtextwrite.total_bytes" parameter only,
>> > all are kept default).
>> > - Next, executed the "WordCount" program for that 100 GB dataset.
>> > The "Block Size" in "hdfs-site.xml" is set as 128 MB. Now, according to
>> > calculation, total number of Maps to be executed by the wordcount job
>> > should be 100 GB / 128 MB or 102400 MB / 128 MB = 800.
>> > But when I am executing the job, it is running a total number of 900
>> > i.e., 100 extra.
>> > So, why this extra number of Maps? Although, my job is completing
>> > successfully without any error.
>> > Again, if I don't execute the "RandomTextWwriter" job to create data for
>> > my wordcount, rather I put my own 100 GB text file in HDFS and run
>> > "WordCount", I can then see the number of Maps are equivalent to my
>> > calculation, i.e., 800.
>> > Can anyone tell me why this odd behaviour of Hadoop regarding the number
>> > of Maps for WordCount only when the dataset is generated by
>> > RandomTextWriter? And what is the purpose of these extra number of Maps?
>> > Regards,