Answer depends on a couple of features to be present in your version
of Hadoop, and is inline.
On Fri, Dec 7, 2012 at 11:38 PM, Jean-Marc Spaggiari
<[EMAIL PROTECTED]> wrote:
> Is there a way for force the tasks from a MR job to run ONLY on the
> taskservers where the input split location is?
There is no severely strict version to do this, but there are
improvements you could make to configuration to make conditions more
favorable to have data local tasks.
> I mean, on the taskdetails UI, I can see all my tasks (25), and some
> of them have Machine == Input split Location. But some don't.
It is sometimes normal to see non-data-local tasks among mostly
data-local tasks in MR - this is due to availability of
slots/resources during job scheduling.
> So I'm wondering if there is a way to force hadoop to run those tasks
> "locally" or else discard them and wait for the local server to be
> able to run them?
You need a good scheduler that can address your needs.
For FairScheduler, in 1.x or so, you can utilize
mapred.fairscheduler.locality.delay, set in milliseconds in your
mapred-site.xml, to indicate the maximum period of wait for a task to
get scheduled with demanded locality. Ideally you'd want to set this
to a period slightly greater than the average time between two
heartbeats from a single tasktracker to the jobtracker. The 2.x one
does it automatically, seems like.
For CapacityScheduler, there isn't any form of delay factor in 1.x
releases. In 2.x however, CapacityScheduler has the
yarn.scheduler.capacity.node-locality-delay config property that can
be set for a similar effect.
Note that the reason MR does not do absolutely strict scheduling is
for many reasons, one of them also being to counter failure or
unavailability of the target node for an assumed infinite period. Most
users would not prefer their tasks to hang in wait forever due to any
of such situations, and a few non-data local tasks in the job don't
hurt the overall execution time too much.