Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Re: How can a YarnTask read/write local-host HDFS blocks?


Copy link to this message
-
Re: How can a YarnTask read/write local-host HDFS blocks?
Hi,

On Sat, Jun 22, 2013 at 4:21 PM, blah blah <[EMAIL PROTECTED]> wrote:
> Hi all
>
> Disclaimer
> I am creating a prototype Application Master. I am using old Yarn
> development version. Revision 1437315, from 2013-01-23 (SNAPSHOT 3.0.0). I
> can not update to current trunk version, as prototype deadline is soon, and
> I don't have time to include Yarn API changes.
>
> My cluster setup is as follows:
> - each computational node acts as NodeManager and as a DataNode
> - dedicated single node for the ResourceManager and NameNode
>
> I have scheduled Containers/Tasks to the hosts which hold input data HDFS
> blocks to achieve data locality (new AMRMClient.ContainerRequest(capability,
> blocksHosts, racks, pri, numContainers)). I know that the Task schedule is
> not guaranteed (but lets assume Tasks were scheduled directly to hosts with
> input HDFS blocks). I have 3 questions regarding reading/writing data from
> HDFS.
>
> 1. How can a Container/Task read local HDFS block?
> Since Container/Task was scheduled on the same computational node as its
> input HDFS block, how can I read the local block? Should I use
> LocalFileSystem, since HDFS block is stored locally? Any code snippet or
> source code reference will be greatly appreciated.

The HDFS client does local reads automatically if there is a local DN
where they are running (and it has the block they request). A
developer needn't concern themselves with explicitly trying to read
local data - it is done automatically by the framework.

> 2. Multiple Containers on same Host, how to differ which local block should
> be read by which Container/Task?
> In case there are multiple Containers/Tasks scheduled to the same Host, and
> also different input HDFS blocks are stored on the same Host. How can I
> ensure that Container/Task will read "its" HDFS local block. For example
> INPUT consists of 10 blocks, Job uses 5 nodes, and for each node 2
> containers were scheduled, also each node holds 2 distinct HDFS blocks. How
> can I read Block_A in Container_2_Host_A and Block_B in Container_3_Host_A.
> Again any code snippet or source code reference will be greatly appreciated.

You have to basically assign a file (offset + len) to each container
ID you launch. They then have to read this assigned file alone. You
can pass this read info to them via CLI options, some serialized file,
etc..

> 3. Write HDFS block to local node (not local file system).
> How can I write read-processed HDFS blocks back to HDFS, but store it on the
> same local host. As far as I know (if I am wrong please correct me),
> whenever Task writes some data to HDFS, HDFS tries to store it on the same
> host, then rack, then as close as possible (assuming replication factor 3).
> Is this process automated, and simple hdfs.write() will do the trick? You
> know that any code snippet or source code reference will be greatly
> appreciated.

This process is automatic in the same way a local ready is automatic.
You needn't write special code for this.

> Thank you for your help in advance.
>
> regards
> tmp

--
Harsh J
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB