-Re: How can a YarnTask read/write local-host HDFS blocks?
Harsh J 2013-06-22, 11:32
On Sat, Jun 22, 2013 at 4:21 PM, blah blah <[EMAIL PROTECTED]> wrote:
> Hi all
> I am creating a prototype Application Master. I am using old Yarn
> development version. Revision 1437315, from 2013-01-23 (SNAPSHOT 3.0.0). I
> can not update to current trunk version, as prototype deadline is soon, and
> I don't have time to include Yarn API changes.
> My cluster setup is as follows:
> - each computational node acts as NodeManager and as a DataNode
> - dedicated single node for the ResourceManager and NameNode
> I have scheduled Containers/Tasks to the hosts which hold input data HDFS
> blocks to achieve data locality (new AMRMClient.ContainerRequest(capability,
> blocksHosts, racks, pri, numContainers)). I know that the Task schedule is
> not guaranteed (but lets assume Tasks were scheduled directly to hosts with
> input HDFS blocks). I have 3 questions regarding reading/writing data from
> 1. How can a Container/Task read local HDFS block?
> Since Container/Task was scheduled on the same computational node as its
> input HDFS block, how can I read the local block? Should I use
> LocalFileSystem, since HDFS block is stored locally? Any code snippet or
> source code reference will be greatly appreciated.
The HDFS client does local reads automatically if there is a local DN
where they are running (and it has the block they request). A
developer needn't concern themselves with explicitly trying to read
local data - it is done automatically by the framework.
> 2. Multiple Containers on same Host, how to differ which local block should
> be read by which Container/Task?
> In case there are multiple Containers/Tasks scheduled to the same Host, and
> also different input HDFS blocks are stored on the same Host. How can I
> ensure that Container/Task will read "its" HDFS local block. For example
> INPUT consists of 10 blocks, Job uses 5 nodes, and for each node 2
> containers were scheduled, also each node holds 2 distinct HDFS blocks. How
> can I read Block_A in Container_2_Host_A and Block_B in Container_3_Host_A.
> Again any code snippet or source code reference will be greatly appreciated.
You have to basically assign a file (offset + len) to each container
ID you launch. They then have to read this assigned file alone. You
can pass this read info to them via CLI options, some serialized file,
> 3. Write HDFS block to local node (not local file system).
> How can I write read-processed HDFS blocks back to HDFS, but store it on the
> same local host. As far as I know (if I am wrong please correct me),
> whenever Task writes some data to HDFS, HDFS tries to store it on the same
> host, then rack, then as close as possible (assuming replication factor 3).
> Is this process automated, and simple hdfs.write() will do the trick? You
> know that any code snippet or source code reference will be greatly
This process is automatic in the same way a local ready is automatic.
You needn't write special code for this.
> Thank you for your help in advance.