I am creating a prototype Application Master. I am using old Yarn
development version. Revision 1437315, from 2013-01-23 (SNAPSHOT 3.0.0). I
can not update to current trunk version, as prototype deadline is soon, and
I don't have time to include Yarn API changes.
My cluster setup is as follows:
- each computational node acts as NodeManager and as a DataNode
- dedicated single node for the ResourceManager and NameNode
I have scheduled Containers/Tasks to the hosts which hold input data HDFS
blocks to achieve data locality (new
AMRMClient.ContainerRequest(capability, *blocksHosts*, racks, pri,
numContainers)). I know that the Task schedule is not guaranteed (but lets
assume Tasks were scheduled directly to hosts with input HDFS blocks). I
have 3 questions regarding reading/writing data from HDFS.
1. How can a Container/Task read local HDFS block?
Since Container/Task was scheduled on the same computational node as its
input HDFS block, how can I read the local block? Should I use
LocalFileSystem, since HDFS block is stored locally? Any code snippet or
source code reference will be greatly appreciated.
2. Multiple Containers on same Host, how to differ which local block should
be read by which Container/Task?
In case there are multiple Containers/Tasks scheduled to the same Host, and
also different input HDFS blocks are stored on the same Host. How can I
ensure that Container/Task will read "its" HDFS local block. For example
INPUT consists of 10 blocks, Job uses 5 nodes, and for each node 2
containers were scheduled, also each node holds 2 distinct HDFS blocks. How
can I read Block_A in Container_2_Host_A and Block_B in Container_3_Host_A.
Again any code snippet or source code reference will be greatly appreciated.
3. Write HDFS block to local node (not local file system).
How can I write read-processed HDFS blocks back to HDFS, but store it on
the same local host. As far as I know (if I am wrong please correct me),
whenever Task writes some data to HDFS, HDFS tries to store it on the same
host, then rack, then as close as possible (assuming replication factor 3).
Is this process automated, and simple hdfs.write() will do the trick? You
know that any code snippet or source code reference will be greatly
Thank you for your help in advance.