Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> MR job scheduler


Copy link to this message
-
Re: MR job scheduler
OK i'll be a bit more specific ,

Suppose map outputs 100 different keys .

Consider a key "K" whose correspoding values may be on N diff datanodes.
Consider a datanode "D" which have maximum number of values . So instead of
moving the values on "D"
to other systems , it is useful to bring in the values from other datanodes
to "D" to minimize the data movement and
also the delay. Similar is the case with All the other keys . How does the
scheduler take care of this ?
2009/8/21 zjffdu <[EMAIL PROTECTED]>

> Add some detials:
>
> 1. #map is determined by the block size and InputFormat (whether you can
> want to split or not split)
>
> 2. The default scheduler for Hadoop is FIFO, and the Fair Scheduler and
> Capacity Scheduler are other two options as I know.  JobTracker has the
> scheduler.
>
> 3. Once the map task is done, it will tell its own tasktracker, and the
> tasktracker will tell jobtracker, so jobtracker manage all the tasks, and
> it
> will decide how to and when to start the reduce task
>
>
>
> -----Original Message-----
> From: Arun C Murthy [mailto:[EMAIL PROTECTED]]
> Sent: 2009年8月20日 11:41
> To: [EMAIL PROTECTED]
> Subject: Re: MR job scheduler
>
>
> On Aug 20, 2009, at 9:00 AM, bharath vissapragada wrote:
>
> > Hi all,
> >
> > Can anyone tell me how the MR scheduler schedule the MR jobs?
> > How does it decide where t create MAP tasks and how many to create.
> > Once the MAP tasks are over how does it decide to move the keys to the
> > reducer efficiently(minimizing the data movement across the network).
> > Is there any doc available which describes this scheduling process
> > quite
> > efficiently
> >
>
> The #maps is decided by the application. The scheduler decides where
> to execute them.
>
> Once the map is done, the reduce tasks connect to the tasktracker (on
> the node where the map-task executed) and copies the entire output
> over http.
>
> Arun
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB