Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # dev >> Feature request to provide DFSInputStream subclassing mechanism

Copy link to this message
Re: Feature request to provide DFSInputStream subclassing mechanism
On 8 August 2013 21:51, Matevz Tadel <[EMAIL PROTECTED]> wrote:

> Hi Steve,
> Thank you very much for the reality check! Some more answers inline ...
> On 8/8/13 1:30 PM, Steve Loughran wrote:
>> On 7 August 2013 10:59, Jeff Dost <[EMAIL PROTECTED]> wrote:
>>  Hello,
>>> We work in a software development team at the UCSD CMS Tier2 Center.  We
>>> would like to propose a mechanism to allow one to subclass the
>>> DFSInputStream in a clean way from an external package.  First I'd like
>>> to
>>> give some motivation on why and then will proceed with the details.
>>> We have a 3 Petabyte Hadoop cluster we maintain for the LHC experiment at
>>> CERN.  There are other T2 centers worldwide that contain mirrors of the
>>> same data we host.  We are working on an extension to Hadoop that, on
>>> reading a file, if it is found that there are no available replicas of a
>>> block, we use an external interface to retrieve this block of the file
>>> from
>>> another data center.  The external interface is necessary because not all
>>> T2 centers involved in CMS are running a Hadoop cluster as their storage
>>> backend.
>> You are relying on all these T2 site being HDFS clusters with exactly the
>> same block sizing of a named file, so that you can pick up a copy of a
>> block from elsewhere?
> No, it's not just HDFS. But we have a common access method for all sites,
> XRootd (http://xrootd.slac.stanford.**edu/<http://xrootd.slac.stanford.edu/>),
> and that's what our implementation of input stream falls back to using.

OK: so any block access issue triggers fallback.
> Also, HDFS block sizes are not the same ... it's in the hands of T2 admins.
>  Why not just handle a failing FileNotFoundException on an open() and
>> either
>> relay to another site or escalate to T1/T0/Castor? This would be far
>> easier
>> to implement with a front end wrapper for any FileSystem. What you are
>> proposing seems far more brittle in both code and execution.
> We already do fallback to xrootd on open failures from our application
> framework ... the job gets redirected to xrootd proxy which downloads the
> whole file and serves data as the job asks for it. The changes described by
> Jeff are an optimization for cases when a single block becomes unavailable.
Like other said, there's work on heterogenous storage. Maybe you could make
sure that there is some handling there for block unavailablity events -then
you can hook in that to handle it.
> We have a lot of data that is replicated on several sites and also
> available on tape at Fermilab. Popularity of datasets (a couple 100 TB)
> varies quite a lot and what we would like to achieve is to be able to
> reduce replication down to one for files that not many people care for at
> the moment. This should give us about 1 PB of diskspace on every T2 center
> and allow us to be more proactive with data movement by simply observing
> the job queues.
>  In order to implement this functionality, we need to subclass the
>>> DFSInputStream and override the read method, so we can catch IOExceptions
>>> that occur on client reads at the block level.
>> Specific IOEs related to missing blocks, or any IOE -as that could swallow
>> other problems?
> We were looking at that but couldn't quite figure out what else can go
> wrong ... so the plan was to log all the exceptions and see what we can do
> about each of them.

network problems, security. If there isn't a specific IOE subclass for
block-not-found, there ought to be.
> Thanks also for the comments below, Jeff will appreciate them more as he
> actually went through all the HDFS code and did all the changes in the
> patch, I did the JNI + Xrootd part.
> One more question ... if we'd converge on an acceptable solution (I find
> it a bit hard at the moment), how long would it take for the changes to be
> released and what release cycle would it end up in?

all changes go into trunk, if it was go to into the 2.x line then it would
be 2.3 at the earliest; 2.1 is going through its beta right now. If you
work against 2.1 and report bugs now, that would help the beta and make it
easier for you to have a private fork of 2.1.x with the extensions