Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> HDFS short-circuit reads


+
John Lilley 2013-12-17, 00:21
Copy link to this message
-
Re: HDFS short-circuit reads
Both of these methods return the same underlying data type that you're
ultimately interested in.  This is the BlockLocation object, which contains
the hosts that have a replica of the block.  Depending on your usage
pattern, one of these methods might be more convenient than the other.

If your application's input is a single file, then you'll likely find that
getFileBlockLocations is a good fit.  This will give you the BlockLocation
information for that one file, and you won't need to write extra code to
pull it out of the RemoteIterator (which you know is only going to contain
one result anyway).

If your application's input is a whole directory, and you then process all
files within that directory, then you'll likely find listLocatedStatus to
be more convenient.  You'll be able to make a single RPC call to get all of
the BlockLocation information for all files.  (Like you said, one call
instead of many.)

Chris Nauroth
Hortonworks
http://hortonworks.com/

On Tue, Dec 17, 2013 at 6:39 AM, John Lilley <[EMAIL PROTECTED]>wrote:

>  Thanks!   I do call FileSytem.getFileBlockLocations() now to map tasks
> to local data blocks; is there any advantage to using listLocatedStatus()
> instead?  I guess one call instead of two…
>
> John
>
>
>
>
>
> *From:* Chris Nauroth [mailto:[EMAIL PROTECTED]]
> *Sent:* Monday, December 16, 2013 6:07 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: HDFS short-circuit reads
>
>
>
> Hello John,
>
>
>
> Short-circuit reads are not on by default.  The documentation page you
> linked to at hadoop.apache.org contains all of the information you need
> to enable them though.
>
>
>
> Regarding checking status of short-circuit read programmatically, here are
> a few thoughts on this:
>
>
>
> Your application could check Configuration for the
> dfs.client.read.shortcircuit key.  This will tell you at a high level if
> the feature is enabled.  However, note that the feature needs to be turned
> on in configuration for both the DataNode and the HDFS client process.
>  Depending on the details of the deployment, the DataNode and the client
> might be using different configuration files.
>
>
>
> This tells you if the feature is enabled, but it doesn't necessarily tell
> you if you're really going to get short-circuit reads when you open the
> file.  There might not be a local replica for the block, in which case the
> read would fall back to the typical remote read behavior anyway.
>
>
>
> Depending on what your application wants to achieve, you might also be
> interested in looking at the FileSystem.listLocatedStatus API to query
> information about blocks and the corresponding locations of replicas.
>  Applications like MapReduce use this information to try to schedule their
> work for optimal locality.  Short-circuit reads then become a further
> optimization on top of the gains already achieved by locality.
>
>
>
> Hope this helps,
>
>
>   Chris Nauroth
>
> Hortonworks
>
> http://hortonworks.com/
>
>
>
>
>
> On Mon, Dec 16, 2013 at 4:21 PM, John Lilley <[EMAIL PROTECTED]>
> wrote:
>
> Our YARN application would benefit from maximal bandwidth on HDFS reads.
>
> But I’m unclear on how short-circuit reads are enabled.
>
> Are they on by default?
>
> Can our application check programmatically to see if the short-circuit
> read is enabled?
>
> *Thanks,*
>
> *john*
>
>
>
> RE:
>
>
> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html
>
> https://issues.apache.org/jira/browse/HDFS-347
>
>
>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have

CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB