-Re: Compatibility in Apache Hadoop
Andrew Purtell 2013-04-23, 18:32
At the risk of hijacking this conversation a bit, what do you think of the
notion of moving interfaces like Seekable and PositionedReadable into a new
foundational Maven module, perhaps just for such interfaces that define and
tag support for core semantics, as their details are better defined and
documented? I was involved in a discussion today considering factoring out
the codecs so other ecosystem projects might pull in only codec code.
Similar to how hadoop-auth is slender and has a useful servlet filter
implementing SPEGNO authentication, and so it is pulled into various
places, and can even be used with Hadoop 1. The only thing preventing a
clean separation of codecs like this is imports of Seekable and
PositionedReadable. But these define behavior, they don't implement it.
On Tue, Apr 23, 2013 at 9:00 AM, Steve Loughran <[EMAIL PROTECTED]>wrote:
> On 22 April 2013 18:32, Eli Collins <[EMAIL PROTECTED]> wrote:
> > On Mon, Apr 22, 2013 at 5:42 PM, Steve Loughran <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > There's a separate issue that says "we make some guarantee that the
> > > behaviour of a interface remains consistent over versions", which is
> > > to do without some rigorous definition of what the expected behaviour
> > an
> > > implementation should be.
> > Good point, Steve. I've assumed the semantics of the API had to
> > respect the attribute (eg changing the semantics of FileSystem#close
> > would be an incompatible change, since this is a public/stable API,
> > even if the new semantics are arguably better). But you're right,
> > unless we've actually defined what the semantics of the APIs are it's
> > hard to say if we've materially changed them. How about adding a new
> > section on the page and calling that out explicitly?
> Maybe we should list which bits we consider both well specified and covered
> with tests that verify the implementations in our svn match that
> > In practice I think we'll have to take semantics case by case, clearly
> > define the semantics we care about better in the javadocs (for the
> > major end user-facing classes at least, calling out both intended
> > behavior and behavior that's meant to be undefined) and using
> > individual judgement elsewhere. For example, HDFS-4156 changed
> > DataInputStream#seek to throw an IOE if you seek to a negative offset,
> > instead of succeeding then resulting in an NPE on the next access.
> I'd seen that the DFS seek was the best implementation, but hadn't seen the
> cause. The other ones (especially the Buffered one that goes in front of
> most others) is much weaker
> > That's an incompatible change in terms of semantics, but not semantics
> > intended by the author, or likely semantics programs depend on.
> That's a key problem: what do people depend on? A lot of the junit tests
> depended on ordering of methods, after all
> > However if a change made FileSystem#close three times slower, this
> > perhaps a smaller semantic change (eg doesn't change what exceptions
> > get thrown) but probably much less tolerable for end users.
> You know that the blobstores all buffer their data so that
> 1. flush() is a no-op
> 2. the write takes place on close()
> #1 changes durability expectations, while #2 means the time to close() is
> O(data)*O(latency); P(fail) scales with time and distance, and as lots of
> code swallows exceptions on close, those failures may even miss.
> then there's the assumption that rename is atomic, which MapReduce depends
> > In any case, even if we get an 80% solution to the semantics issue
> > we'll probably be in good shape for v2 GA if we can sort out the
> > remaining topics. See any other topics missing? Once the overall
> > outline is in shape it make sense to annotate the page with the
> > current policy (if there's already consensus on one), and identifying
> > areas where we need to come up with a policy or are leaving TBD.
Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)