Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: Assigning reduce tasks to specific nodes


Copy link to this message
-
Re: Assigning reduce tasks to specific nodes
Hi Hioryuki,

Lately I've changed scheduler for improving hadoop, so I may help you.

RMContainerAllocator#handleEvent decides MapTasks to allocated containers.
  You can implement semi-strict(best effort allocation) mode by hacking
there. Note that, however, allocation of containers is done
by ResourceManager. MRAppMaster can not control where to allocate
containers, but where to allocate MapTasks.

If you have any question, please ask me.

Thanks,
Tsuyoshi
On Sat, Dec 8, 2012 at 4:51 AM, Jean-Marc Spaggiari <[EMAIL PROTECTED]
> wrote:

> Hi Hiroyuki,
>
> Have you made any progress on that?
>
> I'm also looking at a way to assign specific Map tasks to specific
> nodes (I want the Map to run where the data is).
>
> JM
>
> 2012/12/1, Michael Segel <[EMAIL PROTECTED]>:
> > I haven't thought about reducers but in terms of mappers you need to
> > override the data locality so that it thinks that the node where you
> want to
> > send the data exists.
> > Again, not really recommended since it will kill performance unless the
> > compute time is at least an order of magnitude greater than the time it
> > takes to transfer the data.
> >
> > Really, really don't recommend it....
> >
> > We did it as a hack, just to see if we could do it and get better overall
> > performance for a specific job.
> >
> >
> > On Dec 1, 2012, at 6:27 AM, Harsh J <[EMAIL PROTECTED]> wrote:
> >
> >> Yes, scheduling is done on a Tasktracker heartbeat basis, so it is
> >> certainly possible to do absolutely strict scheduling (although be
> >> aware of the condition of failing/unavailable tasktrackers).
> >>
> >> Mohit's suggestion is somewhat like what you desire (delay scheduling
> >> in fair scheduler config) - but setting it to very high values is bad
> >> to do (for jobs that don't need this).
> >>
> >> On Sat, Dec 1, 2012 at 4:11 PM, Hiroyuki Yamada <[EMAIL PROTECTED]>
> >> wrote:
> >>> Thank you all for the comments.
> >>>
> >>>> you ought to make sure your scheduler also does non-strict scheduling
> of
> >>>> data local tasks for jobs
> >>> that don't require such strictness
> >>>
> >>> I just want to make sure one thing.
> >>> If I write my own scheduler, is it possible to do "strict" scheduling ?
> >>>
> >>> Thanks
> >>>
> >>> On Thu, Nov 29, 2012 at 1:56 PM, Mohit Anchlia <[EMAIL PROTECTED]
> >
> >>> wrote:
> >>>> Look at locality delay parameter
> >>>>
> >>>> Sent from my iPhone
> >>>>
> >>>> On Nov 28, 2012, at 8:44 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> >>>>
> >>>>> None of the current schedulers are "strict" in the sense of "do not
> >>>>> schedule the task if such a tasktracker is not available". That has
> >>>>> never been a requirement for Map/Reduce programs and nor should be.
> >>>>>
> >>>>> I feel if you want some code to run individually on all nodes for
> >>>>> whatever reason, you may as well ssh into each one and start it
> >>>>> manually with appropriate host-based parameters, etc.. and then
> >>>>> aggregate their results.
> >>>>>
> >>>>> Note that even if you get down to writing a scheduler for this (which
> >>>>> I don't think is a good idea, but anyway), you ought to make sure
> your
> >>>>> scheduler also does non-strict scheduling of data local tasks for
> jobs
> >>>>> that don't require such strictness - in order for them to complete
> >>>>> quickly than wait around for scheduling in a fixed manner.
> >>>>>
> >>>>> On Thu, Nov 29, 2012 at 6:00 AM, Hiroyuki Yamada <[EMAIL PROTECTED]
> >
> >>>>> wrote:
> >>>>>> Thank you all for the comments and advices.
> >>>>>>
> >>>>>> I know it is not recommended to assigning mapper locations by
> myself.
> >>>>>> But There needs to be one mapper running in each node in some cases,
> >>>>>> so I need a strict way to do it.
> >>>>>>
> >>>>>> So, locations is taken care of by JobTracker(scheduler), but it is
> not
> >>>>>> strict.
> >>>>>> And, the only way to do it strictly is making a own scheduler, right
> >>>>>> ?
> >>>>>>
> >>>>>> I have checked the source and I am not sure where to modify to do
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB