It's hard to answer without more concrete criteria. Is this a
trademark question affecting the marketing of a product? A
cross-compatibility taxonomy for users? The minimum criteria to
publish a paper/release a product without eye-rolling? The particular
compatibility claims made by a system will be nuanced and specific; a
runtime that executes MapReduce jobs as they would run in Hadoop can
simply make that claim, whether it uses parts of MapReduce, HDFS, or
For the various distributions "Powered by Apache Hadoop," one would
assume that compatibility will vary depending on the featureset and
the audience. A distribution that runs MapReduce applications
as-written for Apache Hadoop may be incompatible with a user's
deployed metrics/monitoring system. Some random script to scrape the
UI may not work. The product may only scale to 20 nodes. Whether these
are "compatible with Apache Hadoop" is awkward to answer generally,
unless we want to define the semantics of that phrase by policy.
To put it bluntly, why would we bother to define such a policy? One
could assert that a fully-compatible system would implement all the
public/stable APIs as defined in HADOOP-5073, but who would that help?
And though interoperability is certainly relevant to systems built on
top of Hadoop, is there a reason the Apache project needs to be
involved in defining the standards for compatibility among them?
Compatibility matters, but I'm not clear on the objective of this discussion. -C
On Mon, Jan 31, 2011 at 5:18 AM, Steve Loughran <[EMAIL PROTECTED]> wrote:
> what does it mean to be compatible with Hadoop? And how do products that
> consider themselves compatible with Hadoop say it?
> We have plugin schedulers and the like, and all is well, and the Apache
> brand people keep an eye on distributions of the Hadoop code and make sure
> that Apache Hadoop is cleanly distinguished from redistributions of binaries
> by third parties.
> But then you get distributions, and you have to define what is meant in
> terms of functionality and compatibility
> Presumably, everyone who issues their own release has either explicitly or
> implicitly done a lot more testing than is in the unit test suite, testing
> that exists to stress test the code on large clusters -is there stuff there
> that needs to be added to SVN to help say a build is of sufficiently quality
> to be released?
> Then there are the questions about
> -things that work with specific versions/releases of Hadoop?
> -replacement filesystems ?
> -replacement of core parts of the system, like the MapReduce Engine?
> IBM have have been talking about "Hadoop on GPFS"
> If this is running the MR layer, should it say "Apache Hadoop MR engine on
> top of IBM GPFS", or what -and how do you define or assess compatibility at
> this point? Is it up to the vendor to say "works with Apache Hadoop", and is
> running the Terasort client code sufficient to say "compatible"?
> Similarly, if the MapReduce engine gets swapped out, what then? We in HP
> Labs have been funding some exploratory work at universities in Berlin on an
> engine that does more operations than just map and reduce, but it will also
> handle the existing operations with API compatibility on the worker nodes.
> The goal here is research with an OSS deliverable, but while it may support
> Hadoop jobs, it's not Hadoop.
> What to call such things?