Answers inline. Will address the DistributedShell questions in a follow-up.
On Mar 12, 2013, at 7:31 AM, Ioan Zeng wrote:
> Hello Hadoop Team,
> I need some help/support in evaluating Hadoop YARN from the perspective of:
> - Adding/removing test nodes dynamically without restart
> Is this possible in YARN? or is it only a MapReduce application feature?
Yarn Feature. Each node has a NodeManager daemon running on it. The Resourcemanager tracks nodes' health
via heartbeats from the NodeManager. In that respect, unhealthy nodes will be removed from the available
pool and when an application requests for resources, it will not get allocated to an unhealthy or down node.
> - Failover from crashed test nodes
> Is this automatically done by the YARN framework?
When a node fails, containers running on it are also reported as failed. This feedback about failed nodes and failed containers
is passed to the application by the ResourceManager. How the application handles these failures and acts upon the failure events to
recover is based on application code. Each application will have its own logic to recover - YARN provides all the information for an application
to act upon but does not take any action on the application's behalf ( slight caveat that the yarn framework does support restarting ( up to a max
retry attempt limit ) the application itself if the application died or the node on which the application was running died.
> - Simple prioritization of jobs
> I know there is a prioritization possibility of the tasks in the
> Scheduler, right?
Yes and no - depends on which scheduler you configure the RM with and how you configure it. The default scheduler used is the CapacityScheduler.
Details on that are at http://hadoop.apache.org/docs/stable/capacity_scheduler.html.
> - Insight into queue (simple reporting, including running jobs on nodes)
> Is there any reporting possibility in YARN regarding the status of
> the not executed, running, finished jobs?
The YARN UI provides these details for the last N applications that were handled ( N being around 10000 ). There are also webservices to access this
data in xml/json format. There is a plan for a generic application history server but that is still being worked on.
> - Support heterogenous OSes on nodes (or use separate masters for
> homogenous grids)
> I think this is supported, right?
I don't believe that there is anything stopping support of a heterogenous cluster. However, I am not sure if we have come across anyone
who has tried that out. A lot of the problems may arise based on how well a application is written to handle launching containers correctly on different OS types.
The scheduler today does not account for asking for containers on only certain OS types. I am guessing there might be some minor features that be needed to be
addressed to fully support it. If you do try the above use case, please let us know if there are features/issues that you would like to see addressed.
> Another point I would like to evaluate is the Distributed Shell example usage.
> Our use case is to start different scripts on a grid. Once a node has
> finished a script a new script has to be started on it. A report about
> the scripts execution has to be provided. in case a node has failed to
> execute a script it should be re-executed on a different node. Some
> scripts are Windows specific other are Unix specific and have to be
> executed on a node with a specific OS.
> The question is:
> Would it be feasible to adapt the example "Distributed Shell"
> application to have the above features?
> If yes how could I run some specific scripts only on a specific OS? Is
> this the ResourceManager responsability? What happens if there is no
> Windows node for example in the grid but in the queue there is a
> Windows script?
> How to re-execute failed scripts? Does it have to be implemented by
> custom code, or is it a built in feature of YARN?
> Thank you in advance for your support,