Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # dev >> Feature request to provide DFSInputStream subclassing mechanism


Copy link to this message
-
Re: Feature request to provide DFSInputStream subclassing mechanism
On 8 August 2013 21:51, Matevz Tadel <[EMAIL PROTECTED]> wrote:

> Hi Steve,
>
> Thank you very much for the reality check! Some more answers inline ...
>
>
> On 8/8/13 1:30 PM, Steve Loughran wrote:
>
>> On 7 August 2013 10:59, Jeff Dost <[EMAIL PROTECTED]> wrote:
>>
>>  Hello,
>>>
>>> We work in a software development team at the UCSD CMS Tier2 Center.  We
>>> would like to propose a mechanism to allow one to subclass the
>>> DFSInputStream in a clean way from an external package.  First I'd like
>>> to
>>> give some motivation on why and then will proceed with the details.
>>>
>>> We have a 3 Petabyte Hadoop cluster we maintain for the LHC experiment at
>>> CERN.  There are other T2 centers worldwide that contain mirrors of the
>>> same data we host.  We are working on an extension to Hadoop that, on
>>> reading a file, if it is found that there are no available replicas of a
>>> block, we use an external interface to retrieve this block of the file
>>> from
>>> another data center.  The external interface is necessary because not all
>>> T2 centers involved in CMS are running a Hadoop cluster as their storage
>>> backend.
>>>
>>>
>> You are relying on all these T2 site being HDFS clusters with exactly the
>> same block sizing of a named file, so that you can pick up a copy of a
>> block from elsewhere?
>>
>
> No, it's not just HDFS. But we have a common access method for all sites,
> XRootd (http://xrootd.slac.stanford.**edu/<http://xrootd.slac.stanford.edu/>),
> and that's what our implementation of input stream falls back to using.
>

OK: so any block access issue triggers fallback.
>
> Also, HDFS block sizes are not the same ... it's in the hands of T2 admins.
>
>
>  Why not just handle a failing FileNotFoundException on an open() and
>> either
>> relay to another site or escalate to T1/T0/Castor? This would be far
>> easier
>> to implement with a front end wrapper for any FileSystem. What you are
>> proposing seems far more brittle in both code and execution.
>>
>
> We already do fallback to xrootd on open failures from our application
> framework ... the job gets redirected to xrootd proxy which downloads the
> whole file and serves data as the job asks for it. The changes described by
> Jeff are an optimization for cases when a single block becomes unavailable.
>
>
Like other said, there's work on heterogenous storage. Maybe you could make
sure that there is some handling there for block unavailablity events -then
you can hook in that to handle it.
> We have a lot of data that is replicated on several sites and also
> available on tape at Fermilab. Popularity of datasets (a couple 100 TB)
> varies quite a lot and what we would like to achieve is to be able to
> reduce replication down to one for files that not many people care for at
> the moment. This should give us about 1 PB of diskspace on every T2 center
> and allow us to be more proactive with data movement by simply observing
> the job queues.
>
>
>  In order to implement this functionality, we need to subclass the
>>> DFSInputStream and override the read method, so we can catch IOExceptions
>>> that occur on client reads at the block level.
>>>
>>>
>> Specific IOEs related to missing blocks, or any IOE -as that could swallow
>> other problems?
>>
>
> We were looking at that but couldn't quite figure out what else can go
> wrong ... so the plan was to log all the exceptions and see what we can do
> about each of them.
>

network problems, security. If there isn't a specific IOE subclass for
block-not-found, there ought to be.
>
> Thanks also for the comments below, Jeff will appreciate them more as he
> actually went through all the HDFS code and did all the changes in the
> patch, I did the JNI + Xrootd part.
>
> One more question ... if we'd converge on an acceptable solution (I find
> it a bit hard at the moment), how long would it take for the changes to be
> released and what release cycle would it end up in?
>

all changes go into trunk, if it was go to into the 2.x line then it would
be 2.3 at the earliest; 2.1 is going through its beta right now. If you
work against 2.1 and report bugs now, that would help the beta and make it
easier for you to have a private fork of 2.1.x with the extensions