Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # dev >> Feature request to provide DFSInputStream subclassing mechanism


Copy link to this message
-
Re: Feature request to provide DFSInputStream subclassing mechanism
Hi Jeff,

Do you need to subclass or could you simply wrap? Generally composition as
opposed to inheritance is a lot safer way of integrating software written
by different parties, since inheritance exposes all the implementation
details which are subject to change.

-Todd

On Wed, Aug 7, 2013 at 10:59 AM, Jeff Dost <[EMAIL PROTECTED]> wrote:

> Hello,
>
> We work in a software development team at the UCSD CMS Tier2 Center.  We
> would like to propose a mechanism to allow one to subclass the
> DFSInputStream in a clean way from an external package.  First I'd like to
> give some motivation on why and then will proceed with the details.
>
> We have a 3 Petabyte Hadoop cluster we maintain for the LHC experiment at
> CERN.  There are other T2 centers worldwide that contain mirrors of the
> same data we host.  We are working on an extension to Hadoop that, on
> reading a file, if it is found that there are no available replicas of a
> block, we use an external interface to retrieve this block of the file from
> another data center.  The external interface is necessary because not all
> T2 centers involved in CMS are running a Hadoop cluster as their storage
> backend.
>
> In order to implement this functionality, we need to subclass the
> DFSInputStream and override the read method, so we can catch IOExceptions
> that occur on client reads at the block level.
>
> The basic steps required:
> 1. Invent a new URI scheme for the customized "FileSystem" in
> core-site.xml:
>   <property>
>     <name>fs.foofs.impl</name>
>     <value>my.package.**FooFileSystem</value>
>     <description>My Extended FileSystem for foofs: uris.</description>
>   </property>
>
> 2. Write new classes included in the external package that subclass the
> following:
> FooFileSystem subclasses DistributedFileSystem
> FooFSClient subclasses DFSClient
> FooFSInputStream subclasses DFSInputStream
>
> Now any client commands that explicitly use the foofs:// scheme in paths
> to access the hadoop cluster can open files with a customized InputStream
> that extends functionality of the default hadoop client DFSInputStream.  In
> order to make this happen for our use case, we had to change some access
> modifiers in the DistributedFileSystem, DFSClient, and DFSInputStream
> classes provided by Hadoop.  In addition, we had to comment out the check
> in the namenode code that only allows for URI schemes of the form "hdfs://".
>
> Attached is a patch file we apply to hadoop.  Note that we derived this
> patch by modding the Cloudera release hadoop-2.0.0-cdh4.1.1 which can be
> found at:
> http://archive.cloudera.com/**cdh4/cdh/4/hadoop-2.0.0-cdh4.**1.1.tar.gz<http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0-cdh4.1.1.tar.gz>
>
> We would greatly appreciate any advise on whether or not this approach
> sounds reasonable, and if you would consider accepting these modifications
> into the official Hadoop code base.
>
> Thank you,
> Jeff, Alja & Matevz
> UCSD Physics
>

--
Todd Lipcon
Software Engineer, Cloudera
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB