Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - MR job scheduler


Copy link to this message
-
Re: MR job scheduler
bharath vissapragada 2009-08-21, 06:29
Amogh

i think Reduce phase starts only when all the map phases are completed .
Because it needs all the values corresponding to a particular key!

2009/8/21 Amogh Vasekar <[EMAIL PROTECTED]>

> I'm not sure that is the case with Hadoop. I think its assigning reduce
> task to an available tasktracker at any instant; Since a reducer polls JT
> for completed maps. And if it were the case as you said, a reducer wont be
> initialized until all maps have completed , after which copy phase would
> start.
>
> Thanks,
> Amogh
>
> -----Original Message-----
> From: bharath vissapragada [mailto:[EMAIL PROTECTED]]
> Sent: Friday, August 21, 2009 9:50 AM
> To: [EMAIL PROTECTED]
> Subject: Re: MR job scheduler
>
> OK i'll be a bit more specific ,
>
> Suppose map outputs 100 different keys .
>
> Consider a key "K" whose correspoding values may be on N diff datanodes.
> Consider a datanode "D" which have maximum number of values . So instead of
> moving the values on "D"
> to other systems , it is useful to bring in the values from other datanodes
> to "D" to minimize the data movement and
> also the delay. Similar is the case with All the other keys . How does the
> scheduler take care of this ?
> 2009/8/21 zjffdu <[EMAIL PROTECTED]>
>
> > Add some detials:
> >
> > 1. #map is determined by the block size and InputFormat (whether you can
> > want to split or not split)
> >
> > 2. The default scheduler for Hadoop is FIFO, and the Fair Scheduler and
> > Capacity Scheduler are other two options as I know.  JobTracker has the
> > scheduler.
> >
> > 3. Once the map task is done, it will tell its own tasktracker, and the
> > tasktracker will tell jobtracker, so jobtracker manage all the tasks, and
> > it
> > will decide how to and when to start the reduce task
> >
> >
> >
> > -----Original Message-----
> > From: Arun C Murthy [mailto:[EMAIL PROTECTED]]
> > Sent: 2009年8月20日 11:41
> > To: [EMAIL PROTECTED]
> > Subject: Re: MR job scheduler
> >
> >
> > On Aug 20, 2009, at 9:00 AM, bharath vissapragada wrote:
> >
> > > Hi all,
> > >
> > > Can anyone tell me how the MR scheduler schedule the MR jobs?
> > > How does it decide where t create MAP tasks and how many to create.
> > > Once the MAP tasks are over how does it decide to move the keys to the
> > > reducer efficiently(minimizing the data movement across the network).
> > > Is there any doc available which describes this scheduling process
> > > quite
> > > efficiently
> > >
> >
> > The #maps is decided by the application. The scheduler decides where
> > to execute them.
> >
> > Once the map is done, the reduce tasks connect to the tasktracker (on
> > the node where the map-task executed) and copies the entire output
> > over http.
> >
> > Arun
> >
> >
>