Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> MR job scheduler


Copy link to this message
-
Re: MR job scheduler
Yes , My doubt is that how is the location of the reducer selected . Is it
selected arbitrarily or is selected on a particular machine which has
already the more values (corresponding to the key of that reducer) which
reduces the cost of transferring data across the network(because already
many values to that key are on that machine where the map phase completed)..

2009/8/21 Amogh Vasekar <[EMAIL PROTECTED]>

> Yes, but the copy phase starts with the initialization for a reducer, after
> which it would keep polling for completed map tasks to fetch the respective
> outputs.
>
> -----Original Message-----
> From: bharath vissapragada [mailto:[EMAIL PROTECTED]]
> Sent: Friday, August 21, 2009 12:00 PM
> To: [EMAIL PROTECTED]
> Subject: Re: MR job scheduler
>
> Amogh
>
> i think Reduce phase starts only when all the map phases are completed .
> Because it needs all the values corresponding to a particular key!
>
> 2009/8/21 Amogh Vasekar <[EMAIL PROTECTED]>
>
> > I'm not sure that is the case with Hadoop. I think its assigning reduce
> > task to an available tasktracker at any instant; Since a reducer polls JT
> > for completed maps. And if it were the case as you said, a reducer wont
> be
> > initialized until all maps have completed , after which copy phase would
> > start.
> >
> > Thanks,
> > Amogh
> >
> > -----Original Message-----
> > From: bharath vissapragada [mailto:[EMAIL PROTECTED]]
> > Sent: Friday, August 21, 2009 9:50 AM
> > To: [EMAIL PROTECTED]
> > Subject: Re: MR job scheduler
> >
> > OK i'll be a bit more specific ,
> >
> > Suppose map outputs 100 different keys .
> >
> > Consider a key "K" whose correspoding values may be on N diff datanodes.
> > Consider a datanode "D" which have maximum number of values . So instead
> of
> > moving the values on "D"
> > to other systems , it is useful to bring in the values from other
> datanodes
> > to "D" to minimize the data movement and
> > also the delay. Similar is the case with All the other keys . How does
> the
> > scheduler take care of this ?
> > 2009/8/21 zjffdu <[EMAIL PROTECTED]>
> >
> > > Add some detials:
> > >
> > > 1. #map is determined by the block size and InputFormat (whether you
> can
> > > want to split or not split)
> > >
> > > 2. The default scheduler for Hadoop is FIFO, and the Fair Scheduler and
> > > Capacity Scheduler are other two options as I know.  JobTracker has the
> > > scheduler.
> > >
> > > 3. Once the map task is done, it will tell its own tasktracker, and the
> > > tasktracker will tell jobtracker, so jobtracker manage all the tasks,
> and
> > > it
> > > will decide how to and when to start the reduce task
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Arun C Murthy [mailto:[EMAIL PROTECTED]]
> > > Sent: 2009年8月20日 11:41
> > > To: [EMAIL PROTECTED]
> > > Subject: Re: MR job scheduler
> > >
> > >
> > > On Aug 20, 2009, at 9:00 AM, bharath vissapragada wrote:
> > >
> > > > Hi all,
> > > >
> > > > Can anyone tell me how the MR scheduler schedule the MR jobs?
> > > > How does it decide where t create MAP tasks and how many to create.
> > > > Once the MAP tasks are over how does it decide to move the keys to
> the
> > > > reducer efficiently(minimizing the data movement across the network).
> > > > Is there any doc available which describes this scheduling process
> > > > quite
> > > > efficiently
> > > >
> > >
> > > The #maps is decided by the application. The scheduler decides where
> > > to execute them.
> > >
> > > Once the map is done, the reduce tasks connect to the tasktracker (on
> > > the node where the map-task executed) and copies the entire output
> > > over http.
> > >
> > > Arun
> > >
> > >
> >
>