Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> The location of the map execution


Copy link to this message
-
Re: The location of the map execution
On Sun, Mar 4, 2012 at 4:15 AM, Joey Echeverria <[EMAIL PROTECTED]> wrote:

> I misspoke in my previous e-mail. The default scheduler does do data
> local scheduling, but it's not perfect. When using the default
> scheduler, tasks are assigned to TaskTrackers on every heart beat.
> When a TaskTracker checks in, the JobTracker will assign any tasks
> that are node-local or rack-local. When you run a job with a single
> map task, it's very likely that a rack-local TaskTracker will become
> available before a node-local one does. This means that for jobs with
> a small task count, you're less likely to get data locality. For jobs
> with a task count close to or greater than the number of TaskTrackers,
> you're much more likely to get node-local assignments.
>

Thanks for the clarification. It helps a lot. I am learning things every
day. In my case my input splits are somewhere in between 200-300. Does it
still make sense to use FairScheduler? What do people generally use?

>
> -Joey
>
> On Sat, Mar 3, 2012 at 10:44 PM, Mohit Anchlia <[EMAIL PROTECTED]>
> wrote:
> > On Sat, Mar 3, 2012 at 7:41 PM, Joey Echeverria <[EMAIL PROTECTED]>
> wrote:
> >>
> >> Sorry, I meant have you set the mapred.jobtracker.taskScheduler
> >> property in your mapred-site.xml file. If not, you're using the
> >> standard, FIFO scheduler. The default scheduler doesn't do data-local
> >> scheduling, but the fair scheduler and capacity scheduler do. You want
> >> to set mapred.jobtracker.taskScheduler to either
> >> org.apache.hadoop.mapred.FairScheduler (for the fair scheduler) or
> >> org.apache.hadoop.mapred.CapacityTaskScheduler (for the capacity
> >> scheduler) and then restart the JobTracker. You can read about the two
> >> schedulers here:
> >>
> >> http://hadoop.apache.org/common/docs/current/fair_scheduler.html
> >> http://hadoop.apache.org/common/docs/current/capacity_scheduler.html
> >>
> >
> > I thought by default tasks are scheduled on those nodes that have those
> data
> > blocks. I thought that was inherent. In the faire scheduler link I don't
> see
> > anything about data-local
> >
> >> -Joey
> >>
> >> On Sat, Mar 3, 2012 at 6:32 PM, Hassen Riahi <[EMAIL PROTECTED]>
> wrote:
> >> > The jobtracker is running in another machine (node C)
> >> >
> >> > Hassen
> >> >
> >> >
> >> >> Which scheduler are you using?
> >> >>
> >> >> -Joey
> >> >>
> >> >> On Mar 3, 2012, at 18:52, Hassen Riahi <[EMAIL PROTECTED]> wrote:
> >> >>
> >> >>> Hi all,
> >> >>>
> >> >>> We tried using mapreduce to execute a simple map code which read a
> txt
> >> >>> file stored in HDFS and write then the output.
> >> >>> The file to read is a very small one. It was not split and written
> >> >>> entirely and only in a single datanode (node A). This node is
> >> >>> configured
> >> >>> also as a tasktracker node
> >> >>> While we was expecting that the location of the map execution is
> node
> >> >>> A
> >> >>> (since the input is stored there), from log files, we see that the
> map
> >> >>> was
> >> >>> executed in another tasktracker (node B) of the cluster.
> >> >>> Am I missing something?
> >> >>>
> >> >>> Thanks for the help!
> >> >>> Hassen
> >> >>>
> >> >
> >>
> >>
> >>
> >> --
> >> Joseph Echeverria
> >> Cloudera, Inc.
> >> 443.305.9434
> >
> >
>
>
>
> --
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB