In case of HBase, the locality is bound to be restricted to one node
(the node hosting the region asked for). Otherwise, replication
affects locality (N options).
On Sat, Dec 8, 2012 at 11:27 PM, Jean-Marc Spaggiari
<[EMAIL PROTECTED]> wrote:
> Hi Harsh,
> Thanks for your help.
> mapred.fairscheduler.locality.delay seems to be working very well for
> me. I have set it with 60s and JoInProgress picked up only "Choosing
> data-local task"... It seems to do the job for my usecase. And as you
> are saying, if I'm loosing a node while the job is running, the task
> will still run after 60 seconds on another node
> I have not yet looked at CapacityScheduler, but will most probably later.
> One last thing. I have a replication factor set to 3. Does it mean 3
> TaskTrackers might be able to take any of the tasks and run them
> locally? Or only 1?
> 2012/12/8, Harsh J <[EMAIL PROTECTED]>:
>> Answer depends on a couple of features to be present in your version
>> of Hadoop, and is inline.
>> On Fri, Dec 7, 2012 at 11:38 PM, Jean-Marc Spaggiari
>> <[EMAIL PROTECTED]> wrote:
>>> Is there a way for force the tasks from a MR job to run ONLY on the
>>> taskservers where the input split location is?
>> There is no severely strict version to do this, but there are
>> improvements you could make to configuration to make conditions more
>> favorable to have data local tasks.
>>> I mean, on the taskdetails UI, I can see all my tasks (25), and some
>>> of them have Machine == Input split Location. But some don't.
>> It is sometimes normal to see non-data-local tasks among mostly
>> data-local tasks in MR - this is due to availability of
>> slots/resources during job scheduling.
>>> So I'm wondering if there is a way to force hadoop to run those tasks
>>> "locally" or else discard them and wait for the local server to be
>>> able to run them?
>> You need a good scheduler that can address your needs.
>> For FairScheduler, in 1.x or so, you can utilize
>> mapred.fairscheduler.locality.delay, set in milliseconds in your
>> mapred-site.xml, to indicate the maximum period of wait for a task to
>> get scheduled with demanded locality. Ideally you'd want to set this
>> to a period slightly greater than the average time between two
>> heartbeats from a single tasktracker to the jobtracker. The 2.x one
>> does it automatically, seems like.
>> For CapacityScheduler, there isn't any form of delay factor in 1.x
>> releases. In 2.x however, CapacityScheduler has the
>> yarn.scheduler.capacity.node-locality-delay config property that can
>> be set for a similar effect.
>> Note that the reason MR does not do absolutely strict scheduling is
>> for many reasons, one of them also being to counter failure or
>> unavailability of the target node for an assumed infinite period. Most
>> users would not prefer their tasks to hang in wait forever due to any
>> of such situations, and a few non-data local tasks in the job don't
>> hurt the overall execution time too much.
>> Harsh J