Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # dev >> Compatibility in Apache Hadoop

Copy link to this message
Re: Compatibility in Apache Hadoop
On 22 April 2013 14:00, Karthik Kambatla <[EMAIL PROTECTED]> wrote:

> Hadoop devs,
> This doc does not intend to propose new policies. The idea is to have one
> document that outlines the various compatibility concerns (lots of areas
> beyond API compatibility), captures the respective policies that exist, and
> if we want to define policies for the items where it’s not clear we have
> something to iterate on.
> The first draft just lists the types of compatibility. In the next step, we
> can add existing policies and subsequently work towards policies for
> others.
I don't see -yet- a definition of compatible at the API signature level vs
semantics level.

The @ interface attributes say "these methods are
internal/external/stable/unstable (there's also @VisibleForTesting,that
comes out of guava (yes?).

There's a separate issue that says "we make some guarantee that the
behaviour of a interface remains consistent over versions", which is hard
to do without some rigorous definition of what the expected behaviour of an
implementation should be. Even then, there are epiphenomena whose behaviour
isn't a direct part of the specification but which people ended up relying
on. As an example, look how many junit tests have broken in Java7 after
they changed the order in which methods were enumerated. The new ordering
fulfils the java lan specification "no order is guaranteed", but the
pre-java6 implementation offered "same order as you wrote them in the
file". In Hadoop, there's a lot of code that assumes that close() of an
output stream happens within a short period of time, and if you change that
some things break, but nowhere does {{OutputStream.close()}} ever say
"completes within 30s".

Interface semantics are far more important than simple signatures, and they
are the hardest to guarantee -especially when we don't have rigorous enough
definitions of what implementations should really do.

As an example what does the method Seekable.seek() do?

It's header says
   * Seek to the given offset from the start of the file.
   * The next read() will be from that location.  Can't
   * seek past the end of the file.
  void seek(long pos) throws IOException;

Now, without looking at the source, what do you think the reference
implementation - BufferedFSInputStream - does on any of the sequences:

seek(-1); read()

seek(0); seek(0)

close(); seek(0)
The javadocs don't specify what happens, the implementations don't behave
in ways you'd expect, and that leaves a problem: do you now define the
semantics of seek more rigorously -and fix our implementations -or do you
conclude that some people may expect the current behaviour on negative
seeks (they are ignored, for the curious), and so we can't actually change