Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # dev >> Feature request to provide DFSInputStream subclassing mechanism

Jeff Dost 2013-08-07, 17:59
Copy link to this message
Re: Feature request to provide DFSInputStream subclassing mechanism
Hi Jeff,

Do you need to subclass or could you simply wrap? Generally composition as
opposed to inheritance is a lot safer way of integrating software written
by different parties, since inheritance exposes all the implementation
details which are subject to change.


On Wed, Aug 7, 2013 at 10:59 AM, Jeff Dost <[EMAIL PROTECTED]> wrote:

> Hello,
> We work in a software development team at the UCSD CMS Tier2 Center.  We
> would like to propose a mechanism to allow one to subclass the
> DFSInputStream in a clean way from an external package.  First I'd like to
> give some motivation on why and then will proceed with the details.
> We have a 3 Petabyte Hadoop cluster we maintain for the LHC experiment at
> CERN.  There are other T2 centers worldwide that contain mirrors of the
> same data we host.  We are working on an extension to Hadoop that, on
> reading a file, if it is found that there are no available replicas of a
> block, we use an external interface to retrieve this block of the file from
> another data center.  The external interface is necessary because not all
> T2 centers involved in CMS are running a Hadoop cluster as their storage
> backend.
> In order to implement this functionality, we need to subclass the
> DFSInputStream and override the read method, so we can catch IOExceptions
> that occur on client reads at the block level.
> The basic steps required:
> 1. Invent a new URI scheme for the customized "FileSystem" in
> core-site.xml:
>   <property>
>     <name>fs.foofs.impl</name>
>     <value>my.package.**FooFileSystem</value>
>     <description>My Extended FileSystem for foofs: uris.</description>
>   </property>
> 2. Write new classes included in the external package that subclass the
> following:
> FooFileSystem subclasses DistributedFileSystem
> FooFSClient subclasses DFSClient
> FooFSInputStream subclasses DFSInputStream
> Now any client commands that explicitly use the foofs:// scheme in paths
> to access the hadoop cluster can open files with a customized InputStream
> that extends functionality of the default hadoop client DFSInputStream.  In
> order to make this happen for our use case, we had to change some access
> modifiers in the DistributedFileSystem, DFSClient, and DFSInputStream
> classes provided by Hadoop.  In addition, we had to comment out the check
> in the namenode code that only allows for URI schemes of the form "hdfs://".
> Attached is a patch file we apply to hadoop.  Note that we derived this
> patch by modding the Cloudera release hadoop-2.0.0-cdh4.1.1 which can be
> found at:
> http://archive.cloudera.com/**cdh4/cdh/4/hadoop-2.0.0-cdh4.**1.1.tar.gz<http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0-cdh4.1.1.tar.gz>
> We would greatly appreciate any advise on whether or not this approach
> sounds reasonable, and if you would consider accepting these modifications
> into the official Hadoop code base.
> Thank you,
> Jeff, Alja & Matevz
> UCSD Physics

Todd Lipcon
Software Engineer, Cloudera
Joe Bounour 2013-08-07, 18:11
Andrew Wang 2013-08-07, 18:30
Jeff Dost 2013-08-07, 22:29
Andrew Wang 2013-08-08, 02:47
Matevz Tadel 2013-08-08, 18:52
Colin McCabe 2013-08-08, 19:10
Matevz Tadel 2013-08-08, 21:04
Suresh Srinivas 2013-08-08, 21:17
Steve Loughran 2013-08-08, 20:30
Matevz Tadel 2013-08-09, 04:51
Steve Loughran 2013-08-09, 17:31
Jeff Dost 2013-08-09, 19:52