Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # general >> [VOTE] Shall we adopt the "Defining Hadoop" page

Copy link to this message
Re: [VOTE] Shall we adopt the "Defining Hadoop" page
I agree with this.

We need to find a middle ground that achieves three aims:

1) Makes it clear that an ASF release of Hadoop is THE APACHE HADOOP.  Jeff's manpower argument actually reinforces this.  We need a very testable definition of what is an Apache Hadoop Release or enforcement will be impossible because each test of the policy might require a visit to the supreme court.  It's MD5 matches the MD5 of an apache release is a clear definition.

2) We need a proposal for derived products that vendors feel are branding friendly.  These should be clear enough that users understand the difference between a product that packages Apache Hadoop (MD5 test), one that is completely open source under the Apache license (easy to test) and one that simply uses some subset of the code under a more restrictive license or closed source.

3) Compatibility: I think it would be great to harness all this energy around compatibility to start a compatibility suite inside the Apache Hadoop project.  Then we could define compatible with Apache Hadoop in a clear way controlled by the Apache Hadoop PMC.  With luck vendors on both sides of the debate will be incentivized to contribute to the project this way.  Such a suite would also prove useful to the developers of Apache Hadoop.


On Jun 20, 2011, at 10:09 AM, Ted Dunning wrote:

> Great summary Andrew.
> I would add one more precipitating factor here.  That is the arrival of a
> number of products which are very close to the Apache version of Hadoop but
> for which there is no good and widely accepted terminology that gives proper
> credit to their lineage while making clear the distinction from bit-for-bit
> copies of official Apache releases.
> Some products are analogous to hive, pig or hbase in that they are
> independent systems that run ON hadoop (or close equivalents).  These have
> no terminology problem because these products aren't hadoop, but rather use
> hadoop.
> Other products contain Hadoop internally as a critical component but do not
> necessarily expose Hadoop capabilities to the end user (I can't name these
> products, but they exist).  These products have little nomenclatural
> difficulty because the powerd-by-Hadoop description fits very well.
> The products with the terminology problem are the ones that are add either
> curation and packaging (Cloudera) or substantial additional performance
> enhancing components (MapR).  These products are upwardly compatible with
> Apache Hadoop in that programs that run on Hadoop will very probably run on
> these Hadoop-like systems.  The problem is that there is no good term for
> these products.  They may even contain components that are bit-for-bit
> identical to the same components for Apache releases.  It is fair to say
> that these are not Apache released software, but it is also fair to say that
> there ought to be a better name for the class of these products.
> On Mon, Jun 20, 2011 at 4:39 PM, Andrew Purtell <[EMAIL PROTECTED]> wrote:
>> Hadoop I think needs to be more careful. What triggered this discussion is
>> the arrival of new players releasing products they call Hadoop but
>> containing severe changes the community, by way of the ASF umbrella we all
>> work under, had nothing to do with designing or developing. And some of
>> these are being open sourced as a Hadoop. There is no Linus here. Which of
>> these is _the_ Hadoop? As a would-be contributor, which should I select?