-Re: Force task location on input split location?
Jean-Marc Spaggiari 2012-12-09, 01:49
Ok. Thanks for the clarification. It's to run an HBase job, so it will
be one node restriction for me.
2012/12/8, Harsh J <[EMAIL PROTECTED]>:
> In case of HBase, the locality is bound to be restricted to one node
> (the node hosting the region asked for). Otherwise, replication
> affects locality (N options).
> On Sat, Dec 8, 2012 at 11:27 PM, Jean-Marc Spaggiari
> <[EMAIL PROTECTED]> wrote:
>> Hi Harsh,
>> Thanks for your help.
>> mapred.fairscheduler.locality.delay seems to be working very well for
>> me. I have set it with 60s and JoInProgress picked up only "Choosing
>> data-local task"... It seems to do the job for my usecase. And as you
>> are saying, if I'm loosing a node while the job is running, the task
>> will still run after 60 seconds on another node
>> I have not yet looked at CapacityScheduler, but will most probably later.
>> One last thing. I have a replication factor set to 3. Does it mean 3
>> TaskTrackers might be able to take any of the tasks and run them
>> locally? Or only 1?
>> 2012/12/8, Harsh J <[EMAIL PROTECTED]>:
>>> Answer depends on a couple of features to be present in your version
>>> of Hadoop, and is inline.
>>> On Fri, Dec 7, 2012 at 11:38 PM, Jean-Marc Spaggiari
>>> <[EMAIL PROTECTED]> wrote:
>>>> Is there a way for force the tasks from a MR job to run ONLY on the
>>>> taskservers where the input split location is?
>>> There is no severely strict version to do this, but there are
>>> improvements you could make to configuration to make conditions more
>>> favorable to have data local tasks.
>>>> I mean, on the taskdetails UI, I can see all my tasks (25), and some
>>>> of them have Machine == Input split Location. But some don't.
>>> It is sometimes normal to see non-data-local tasks among mostly
>>> data-local tasks in MR - this is due to availability of
>>> slots/resources during job scheduling.
>>>> So I'm wondering if there is a way to force hadoop to run those tasks
>>>> "locally" or else discard them and wait for the local server to be
>>>> able to run them?
>>> You need a good scheduler that can address your needs.
>>> For FairScheduler, in 1.x or so, you can utilize
>>> mapred.fairscheduler.locality.delay, set in milliseconds in your
>>> mapred-site.xml, to indicate the maximum period of wait for a task to
>>> get scheduled with demanded locality. Ideally you'd want to set this
>>> to a period slightly greater than the average time between two
>>> heartbeats from a single tasktracker to the jobtracker. The 2.x one
>>> does it automatically, seems like.
>>> For CapacityScheduler, there isn't any form of delay factor in 1.x
>>> releases. In 2.x however, CapacityScheduler has the
>>> yarn.scheduler.capacity.node-locality-delay config property that can
>>> be set for a similar effect.
>>> Note that the reason MR does not do absolutely strict scheduling is
>>> for many reasons, one of them also being to counter failure or
>>> unavailability of the target node for an assumed infinite period. Most
>>> users would not prefer their tasks to hang in wait forever due to any
>>> of such situations, and a few non-data local tasks in the job don't
>>> hurt the overall execution time too much.
>>> Harsh J
> Harsh J