Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - The location of the map execution


Copy link to this message
-
Re: The location of the map execution
Mohit Anchlia 2012-03-04, 19:44
On Sun, Mar 4, 2012 at 4:15 AM, Joey Echeverria <[EMAIL PROTECTED]> wrote:

> I misspoke in my previous e-mail. The default scheduler does do data
> local scheduling, but it's not perfect. When using the default
> scheduler, tasks are assigned to TaskTrackers on every heart beat.
> When a TaskTracker checks in, the JobTracker will assign any tasks
> that are node-local or rack-local. When you run a job with a single
> map task, it's very likely that a rack-local TaskTracker will become
> available before a node-local one does. This means that for jobs with
> a small task count, you're less likely to get data locality. For jobs
> with a task count close to or greater than the number of TaskTrackers,
> you're much more likely to get node-local assignments.
>

Thanks for the clarification. It helps a lot. I am learning things every
day. In my case my input splits are somewhere in between 200-300. Does it
still make sense to use FairScheduler? What do people generally use?

>
> -Joey
>
> On Sat, Mar 3, 2012 at 10:44 PM, Mohit Anchlia <[EMAIL PROTECTED]>
> wrote:
> > On Sat, Mar 3, 2012 at 7:41 PM, Joey Echeverria <[EMAIL PROTECTED]>
> wrote:
> >>
> >> Sorry, I meant have you set the mapred.jobtracker.taskScheduler
> >> property in your mapred-site.xml file. If not, you're using the
> >> standard, FIFO scheduler. The default scheduler doesn't do data-local
> >> scheduling, but the fair scheduler and capacity scheduler do. You want
> >> to set mapred.jobtracker.taskScheduler to either
> >> org.apache.hadoop.mapred.FairScheduler (for the fair scheduler) or
> >> org.apache.hadoop.mapred.CapacityTaskScheduler (for the capacity
> >> scheduler) and then restart the JobTracker. You can read about the two
> >> schedulers here:
> >>
> >> http://hadoop.apache.org/common/docs/current/fair_scheduler.html
> >> http://hadoop.apache.org/common/docs/current/capacity_scheduler.html
> >>
> >
> > I thought by default tasks are scheduled on those nodes that have those
> data
> > blocks. I thought that was inherent. In the faire scheduler link I don't
> see
> > anything about data-local
> >
> >> -Joey
> >>
> >> On Sat, Mar 3, 2012 at 6:32 PM, Hassen Riahi <[EMAIL PROTECTED]>
> wrote:
> >> > The jobtracker is running in another machine (node C)
> >> >
> >> > Hassen
> >> >
> >> >
> >> >> Which scheduler are you using?
> >> >>
> >> >> -Joey
> >> >>
> >> >> On Mar 3, 2012, at 18:52, Hassen Riahi <[EMAIL PROTECTED]> wrote:
> >> >>
> >> >>> Hi all,
> >> >>>
> >> >>> We tried using mapreduce to execute a simple map code which read a
> txt
> >> >>> file stored in HDFS and write then the output.
> >> >>> The file to read is a very small one. It was not split and written
> >> >>> entirely and only in a single datanode (node A). This node is
> >> >>> configured
> >> >>> also as a tasktracker node
> >> >>> While we was expecting that the location of the map execution is
> node
> >> >>> A
> >> >>> (since the input is stored there), from log files, we see that the
> map
> >> >>> was
> >> >>> executed in another tasktracker (node B) of the cluster.
> >> >>> Am I missing something?
> >> >>>
> >> >>> Thanks for the help!
> >> >>> Hassen
> >> >>>
> >> >
> >>
> >>
> >>
> >> --
> >> Joseph Echeverria
> >> Cloudera, Inc.
> >> 443.305.9434
> >
> >
>
>
>
> --
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
>