Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> MR job scheduler


Copy link to this message
-
Re: MR job scheduler
Amogh

i think Reduce phase starts only when all the map phases are completed .
Because it needs all the values corresponding to a particular key!

2009/8/21 Amogh Vasekar <[EMAIL PROTECTED]>

> I'm not sure that is the case with Hadoop. I think its assigning reduce
> task to an available tasktracker at any instant; Since a reducer polls JT
> for completed maps. And if it were the case as you said, a reducer wont be
> initialized until all maps have completed , after which copy phase would
> start.
>
> Thanks,
> Amogh
>
> -----Original Message-----
> From: bharath vissapragada [mailto:[EMAIL PROTECTED]]
> Sent: Friday, August 21, 2009 9:50 AM
> To: [EMAIL PROTECTED]
> Subject: Re: MR job scheduler
>
> OK i'll be a bit more specific ,
>
> Suppose map outputs 100 different keys .
>
> Consider a key "K" whose correspoding values may be on N diff datanodes.
> Consider a datanode "D" which have maximum number of values . So instead of
> moving the values on "D"
> to other systems , it is useful to bring in the values from other datanodes
> to "D" to minimize the data movement and
> also the delay. Similar is the case with All the other keys . How does the
> scheduler take care of this ?
> 2009/8/21 zjffdu <[EMAIL PROTECTED]>
>
> > Add some detials:
> >
> > 1. #map is determined by the block size and InputFormat (whether you can
> > want to split or not split)
> >
> > 2. The default scheduler for Hadoop is FIFO, and the Fair Scheduler and
> > Capacity Scheduler are other two options as I know.  JobTracker has the
> > scheduler.
> >
> > 3. Once the map task is done, it will tell its own tasktracker, and the
> > tasktracker will tell jobtracker, so jobtracker manage all the tasks, and
> > it
> > will decide how to and when to start the reduce task
> >
> >
> >
> > -----Original Message-----
> > From: Arun C Murthy [mailto:[EMAIL PROTECTED]]
> > Sent: 2009年8月20日 11:41
> > To: [EMAIL PROTECTED]
> > Subject: Re: MR job scheduler
> >
> >
> > On Aug 20, 2009, at 9:00 AM, bharath vissapragada wrote:
> >
> > > Hi all,
> > >
> > > Can anyone tell me how the MR scheduler schedule the MR jobs?
> > > How does it decide where t create MAP tasks and how many to create.
> > > Once the MAP tasks are over how does it decide to move the keys to the
> > > reducer efficiently(minimizing the data movement across the network).
> > > Is there any doc available which describes this scheduling process
> > > quite
> > > efficiently
> > >
> >
> > The #maps is decided by the application. The scheduler decides where
> > to execute them.
> >
> > Once the map is done, the reduce tasks connect to the tasktracker (on
> > the node where the map-task executed) and copies the entire output
> > over http.
> >
> > Arun
> >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB