|
Steve Loughran
2011-05-10, 10:29
Andrew Purtell
2011-05-11, 07:43
Steve Loughran
2011-05-11, 10:34
Eric Baldeschwieler
2011-05-11, 21:24
Milind Bhandarkar
2011-05-11, 21:46
M. C. Srivas
2011-05-12, 02:26
Ted Dunning
2011-05-12, 04:37
Steve Loughran
2011-05-12, 09:32
Segel, Mike
2011-05-12, 09:49
Eric Baldeschwieler
2011-05-13, 05:05
Milind Bhandarkar
2011-05-12, 16:45
Konstantin Boudnik
2011-05-12, 22:30
Milind Bhandarkar
2011-05-13, 03:40
Konstantin Boudnik
2011-05-13, 06:24
Milind Bhandarkar
2011-05-13, 07:11
Konstantin Boudnik
2011-05-13, 17:47
Ian Holsman
2011-05-11, 22:42
Jacob R Rideout
2011-05-11, 22:56
Aaron Kimball
2011-05-11, 23:20
Steve Loughran
2011-05-12, 09:33
Konstantin Boudnik
2011-05-12, 22:26
Milind Bhandarkar
2011-05-13, 03:37
Ted Dunning
2011-05-13, 04:05
Milind Bhandarkar
2011-05-13, 04:52
Ted Dunning
2011-05-13, 05:38
Konstantin Boudnik
2011-05-13, 06:12
Milind Bhandarkar
2011-05-13, 06:57
Eric Baldeschwieler
2011-05-16, 05:34
Steve Loughran
2011-05-16, 10:50
Steve Loughran
2011-05-12, 09:23
Allen Wittenauer
2011-05-12, 16:45
Doug Cutting
2011-05-13, 06:16
Milind Bhandarkar
2011-05-13, 07:24
Doug Cutting
2011-05-13, 08:53
Ted Dunning
2011-05-13, 13:43
Doug Cutting
2011-05-13, 14:50
Nathan Roberts
2011-05-13, 15:19
Allen Wittenauer
2011-05-13, 17:28
Segel, Mike
2011-05-13, 17:32
Doug Cutting
2011-05-13, 21:55
Allen Wittenauer
2011-05-13, 22:13
Doug Cutting
2011-05-13, 22:16
Allen Wittenauer
2011-05-13, 22:17
Doug Cutting
2011-05-13, 22:22
Steve Loughran
2011-05-16, 11:15
Eli Collins
2011-05-13, 22:18
Ted Dunning
2011-05-13, 22:53
Allen Wittenauer
2011-05-13, 22:57
Steve Loughran
2011-05-16, 11:01
Segel, Mike
2011-05-16, 12:00
Steve Loughran
2011-05-16, 14:11
Allen Wittenauer
2011-05-16, 17:19
Eli Collins
2011-05-16, 21:09
Allen Wittenauer
2011-05-16, 21:25
Eli Collins
2011-05-16, 21:29
Allen Wittenauer
2011-05-16, 21:42
Ian Holsman
2011-05-16, 21:59
Konstantin Boudnik
2011-05-17, 01:52
Matthew Foley
2011-05-16, 21:17
Segel, Mike
2011-05-17, 00:40
Scott Carey
2011-05-17, 01:12
Segel, Mike
2011-05-17, 01:50
Eric Baldeschwieler
2011-05-17, 02:32
Andrew Purtell
2011-05-17, 02:52
Matthew Foley
2011-05-17, 09:19
Segel, Mike
2011-05-17, 12:52
Doug Cutting
2011-05-17, 13:24
Matthew Foley
2011-05-17, 17:53
Doug Cutting
2011-05-18, 13:20
Roy T. Fielding
2011-05-13, 22:26
Eric Baldeschwieler
2011-05-16, 05:34
Steve Loughran
2011-05-16, 11:20
Sanjay Radia
2011-05-23, 16:27
Steve Loughran
2011-05-24, 16:23
Owen O'Malley
2011-05-31, 22:08
|
-
Defining Hadoop Compatibility -revisiting-Steve Loughran 2011-05-10, 10:29
Back in Jan 2011, I started a discussion about how to define Apache Hadoop Compatibility: http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%[EMAIL PROTECTED]%3E I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_1.pdf It claims that their implementations are 100% compatible, even though the Enterprise edition uses a C filesystem. It also claims that both their software releases contain "Certified Stacks", without defining what Certified means, or who does the certification -only that it is an improvement. I think we should revisit this issue before people with their own agendas define what compatibility with Apache Hadoop is for us Licensing -Use of the Hadoop codebase must follow the Apache License http://www.apache.org/licenses/LICENSE-2.0 -plug in components that are dynamically linked to (Filesystems and schedulers) don't appear to be derivative works on my reading of this, Naming -this is something for branding@apache, they will have their opinions. The key one is that the name "Apache Hadoop" must get used, and it's important to make clear it is a derivative work. -I don't think you can claim to have a Distribution/Fork/Version of Apache Hadoop if you swap out big chunks of it for alternate filesystems, MR engines, etc. Some description of this is needed "Supports the Apache Hadoop MapReduce engine on top of Filesystem XYZ" Compatibility -the definition of the Hadoop interfaces and classes is the Apache Source tree, -the definition of semantics of the Hadoop interfaces and classes is the Apache Source tree, including the test classes. -the verification that the actual semantics of an Apache Hadoop release is compatible with the expected semantics is that current and future tests pass -bug reports can highlight incompatibility with expectations of community users, and once incorporated into tests form part of the compatibility testing -vendors can claim and even certify their derivative works as compatible with other versions of their derivative works, but cannot claim compatibility with Apache Hadoop unless their code passes the tests and is consistent with the bug reports marked as ("by design"). Perhaps we should have tests that verify each of these "by design" bugreps to make them more formal. Certification -I have no idea what this means in EMC's case, they just say "Certified" -As we don't do any certification ourselves, it would seem impossible for us to certify that any derivative work is compatible. -It may be best to state that nobody can certify their derivative as "compatible with Apache Hadoop" unless it passes all current test suites -And require that anyone who declares compatibility define what they mean by this This is a good argument for getting more functional tests out there -whoever has more functional tests needs to get them into a test module that can be used to test real deployments. +
Steve Loughran 2011-05-10, 10:29
-
Re: Defining Hadoop Compatibility -revisiting-Andrew Purtell 2011-05-11, 07:43
> From: Steve Loughran <[EMAIL PROTECTED]>
> Subject: Defining Hadoop Compatibility -revisiting- > To: [EMAIL PROTECTED] > Date: Tuesday, May 10, 2011, 3:29 AM > > Back in Jan 2011, I started a discussion about how to > define Apache Hadoop Compatibility: > http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%[EMAIL PROTECTED]%3E > > I am now reading EMC HD "Enterprise Ready" Apache Hadoop > datasheet [...] > -I don't think you can claim to have a > Distribution/Fork/Version of Apache Hadoop if you swap out > big chunks of it for alternate filesystems, MR engines, etc. > Some description of this is needed > "Supports the Apache Hadoop MapReduce engine on top of > Filesystem XYZ" This is also the case with Brisk, which replaces HDFS and the standard JobTracker with Cassandra and a new JobTracker, and claims to be a Hadoop distribution. "Apache Hadoop TM Powered by Cassandra" http://www.datastax.com/products/brisk "DataStax’ Brisk is an enhanced open-source Apache Hadoop and Hive distribution that utilizes Apache Cassandra for many of its core services. [...]" - Andy +
Andrew Purtell 2011-05-11, 07:43
-
Re: Defining Hadoop Compatibility -revisiting-Steve Loughran 2011-05-11, 10:34
On 11/05/2011 08:43, Andrew Purtell wrote:
>> From: Steve Loughran<[EMAIL PROTECTED]> >> -I don't think you can claim to have a >> Distribution/Fork/Version of Apache Hadoop if you swap out >> big chunks of it for alternate filesystems, MR engines, etc. >> Some description of this is needed >> "Supports the Apache Hadoop MapReduce engine on top of >> Filesystem XYZ" > > This is also the case with Brisk, which replaces HDFS and the standard JobTracker with Cassandra and a new JobTracker, and claims to be a Hadoop distribution. > > "Apache Hadoop TM Powered by Cassandra" > http://www.datastax.com/products/brisk > > "DataStax’ Brisk is an enhanced open-source Apache Hadoop and > Hive distribution that utilizes Apache Cassandra for many of > its core services. [...]" > +1. It is something containing Hadoop interfaces and possibly source/artifacts, but I'm not sure how to describe it. It is just something that claims compatibility with Hadoop's filesystem and MR runtime. If Google chose to add the same interfaces to their platform within Google App Engine, it wouldn't be a Hadoop distro either. I think it's important to set some definitions here *now* so that confusion doesn't set in. +
Steve Loughran 2011-05-11, 10:34
-
Re: Defining Hadoop Compatibility -revisiting-Eric Baldeschwieler 2011-05-11, 21:24
This is a really interesting topic! I completely agree that we need to get ahead of this.
I would be really interested in learning of any experience other apache projects, such as apache or tomcat have with these issues. --- E14 - typing on glass On May 10, 2011, at 6:31 AM, "Steve Loughran" <[EMAIL PROTECTED]> wrote: > > Back in Jan 2011, I started a discussion about how to define Apache > Hadoop Compatibility: > http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%[EMAIL PROTECTED]%3E > > I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet > > http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_1.pdf > > It claims that their implementations are 100% compatible, even though > the Enterprise edition uses a C filesystem. It also claims that both > their software releases contain "Certified Stacks", without defining > what Certified means, or who does the certification -only that it is an > improvement. > > > I think we should revisit this issue before people with their own > agendas define what compatibility with Apache Hadoop is for us > > > Licensing > -Use of the Hadoop codebase must follow the Apache License > http://www.apache.org/licenses/LICENSE-2.0 > -plug in components that are dynamically linked to (Filesystems and > schedulers) don't appear to be derivative works on my reading of this, > > Naming > -this is something for branding@apache, they will have their opinions. > The key one is that the name "Apache Hadoop" must get used, and it's > important to make clear it is a derivative work. > -I don't think you can claim to have a Distribution/Fork/Version of > Apache Hadoop if you swap out big chunks of it for alternate > filesystems, MR engines, etc. Some description of this is needed > "Supports the Apache Hadoop MapReduce engine on top of Filesystem XYZ" > > Compatibility > -the definition of the Hadoop interfaces and classes is the Apache > Source tree, > -the definition of semantics of the Hadoop interfaces and classes is > the Apache Source tree, including the test classes. > -the verification that the actual semantics of an Apache Hadoop > release is compatible with the expected semantics is that current and > future tests pass > -bug reports can highlight incompatibility with expectations of > community users, and once incorporated into tests form part of the > compatibility testing > -vendors can claim and even certify their derivative works as > compatible with other versions of their derivative works, but cannot > claim compatibility with Apache Hadoop unless their code passes the > tests and is consistent with the bug reports marked as ("by design"). > Perhaps we should have tests that verify each of these "by design" > bugreps to make them more formal. > > Certification > -I have no idea what this means in EMC's case, they just say "Certified" > -As we don't do any certification ourselves, it would seem impossible > for us to certify that any derivative work is compatible. > -It may be best to state that nobody can certify their derivative as > "compatible with Apache Hadoop" unless it passes all current test suites > -And require that anyone who declares compatibility define what they > mean by this > > This is a good argument for getting more functional tests out there > -whoever has more functional tests needs to get them into a test module > that can be used to test real deployments. > +
Eric Baldeschwieler 2011-05-11, 21:24
-
Re: Defining Hadoop Compatibility -revisiting-Milind Bhandarkar 2011-05-11, 21:46
I think it's time to separate out functional tests as a "Hadoop
Compatibility Kit (HCK)", similar to the Sun TCK for Java, but under ASL 2.0. Then "certification" would mean "Passes 100% of the HCK testsuite." - milind -- Milind Bhandarkar [EMAIL PROTECTED] On 5/11/11 2:24 PM, "Eric Baldeschwieler" <[EMAIL PROTECTED]> wrote: >This is a really interesting topic! I completely agree that we need to >get ahead of this. > >I would be really interested in learning of any experience other apache >projects, such as apache or tomcat have with these issues. > >--- >E14 - typing on glass > >On May 10, 2011, at 6:31 AM, "Steve Loughran" <[EMAIL PROTECTED]> wrote: > >> >> Back in Jan 2011, I started a discussion about how to define Apache >> Hadoop Compatibility: >> >>http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%3C4D >>[EMAIL PROTECTED]%3E >> >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet >> >> >>http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_1. >> >> It claims that their implementations are 100% compatible, even though >> the Enterprise edition uses a C filesystem. It also claims that both >> their software releases contain "Certified Stacks", without defining >> what Certified means, or who does the certification -only that it is an >> improvement. >> >> >> I think we should revisit this issue before people with their own >> agendas define what compatibility with Apache Hadoop is for us >> >> >> Licensing >> -Use of the Hadoop codebase must follow the Apache License >> http://www.apache.org/licenses/LICENSE-2.0 >> -plug in components that are dynamically linked to (Filesystems and >> schedulers) don't appear to be derivative works on my reading of this, >> >> Naming >> -this is something for branding@apache, they will have their opinions. >> The key one is that the name "Apache Hadoop" must get used, and it's >> important to make clear it is a derivative work. >> -I don't think you can claim to have a Distribution/Fork/Version of >> Apache Hadoop if you swap out big chunks of it for alternate >> filesystems, MR engines, etc. Some description of this is needed >> "Supports the Apache Hadoop MapReduce engine on top of Filesystem XYZ" >> >> Compatibility >> -the definition of the Hadoop interfaces and classes is the Apache >> Source tree, >> -the definition of semantics of the Hadoop interfaces and classes is >> the Apache Source tree, including the test classes. >> -the verification that the actual semantics of an Apache Hadoop >> release is compatible with the expected semantics is that current and >> future tests pass >> -bug reports can highlight incompatibility with expectations of >> community users, and once incorporated into tests form part of the >> compatibility testing >> -vendors can claim and even certify their derivative works as >> compatible with other versions of their derivative works, but cannot >> claim compatibility with Apache Hadoop unless their code passes the >> tests and is consistent with the bug reports marked as ("by design"). >> Perhaps we should have tests that verify each of these "by design" >> bugreps to make them more formal. >> >> Certification >> -I have no idea what this means in EMC's case, they just say >>"Certified" >> -As we don't do any certification ourselves, it would seem impossible >> for us to certify that any derivative work is compatible. >> -It may be best to state that nobody can certify their derivative as >> "compatible with Apache Hadoop" unless it passes all current test suites >> -And require that anyone who declares compatibility define what they >> mean by this >> >> This is a good argument for getting more functional tests out there >> -whoever has more functional tests needs to get them into a test module >> that can be used to test real deployments. >> +
Milind Bhandarkar 2011-05-11, 21:46
-
Re: Defining Hadoop Compatibility -revisiting-M. C. Srivas 2011-05-12, 02:26
While the HCK is a great idea to check quickly if an implementation is
"compliant", we still need a written specification to define what is meant by compliance, something akin to a set of RFC's, or a set of docs like the IEEE POSIX specifications. For example, the POSIX.1c pthreads API has a written document that specifies all the function calls, input params, return values, and error codes. It clearly indicates what any POSIX-complaint threads package needs to support, and what are vendor-specific non-portable extensions that one can use at one's own risk. Currently we have 2 sets of API in the DFS and Map/Reduce layers, and the specification is extracted only by looking at the code, or (where the code is non-trivial) by writing really bizarre test programs to examine corner cases. Further, the interaction between a mix of the old and new APIs is not specified anywhere. Such specifications are vitally important when implementing libraries like Cascading, Mahout, etc. For example, an application might open a file using the new API, and pass that stream into a library that manipulates the stream using some of the old API ... what is then the expectation of the state of the stream when the library call returns? Sanjay Radia @ Y! already started specifying some the DFS APIs to nail such things down. There's similar good effort in the Map/Reduce and Avro spaces, but it seems to have stalled somewhat. We should continue it. Doing such specs would be a great service to the community and the users of Hadoop. It provides them (a) clear-cut docs on how to use the Hadoop APIs (b) wider choice of Hadoop implementations by freeing them from vendor lock-in. Once we have such specification, the HCK becomes meaningful (since the HCK itself will be buggy initially). On Wed, May 11, 2011 at 2:46 PM, Milind Bhandarkar <[EMAIL PROTECTED] > wrote: > I think it's time to separate out functional tests as a "Hadoop > Compatibility Kit (HCK)", similar to the Sun TCK for Java, but under ASL > 2.0. Then "certification" would mean "Passes 100% of the HCK testsuite." > > - milind > -- > Milind Bhandarkar > [EMAIL PROTECTED] > > > > > > > On 5/11/11 2:24 PM, "Eric Baldeschwieler" <[EMAIL PROTECTED]> wrote: > > >This is a really interesting topic! I completely agree that we need to > >get ahead of this. > > > >I would be really interested in learning of any experience other apache > >projects, such as apache or tomcat have with these issues. > > > >--- > >E14 - typing on glass > > > >On May 10, 2011, at 6:31 AM, "Steve Loughran" <[EMAIL PROTECTED]> wrote: > > > >> > >> Back in Jan 2011, I started a discussion about how to define Apache > >> Hadoop Compatibility: > >> > >> > http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%3C4D > >>[EMAIL PROTECTED]%3E > >> > >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet > >> > >> > >>http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_1 > . > >> > >> It claims that their implementations are 100% compatible, even though > >> the Enterprise edition uses a C filesystem. It also claims that both > >> their software releases contain "Certified Stacks", without defining > >> what Certified means, or who does the certification -only that it is an > >> improvement. > >> > >> > >> I think we should revisit this issue before people with their own > >> agendas define what compatibility with Apache Hadoop is for us > >> > >> > >> Licensing > >> -Use of the Hadoop codebase must follow the Apache License > >> http://www.apache.org/licenses/LICENSE-2.0 > >> -plug in components that are dynamically linked to (Filesystems and > >> schedulers) don't appear to be derivative works on my reading of this, > >> > >> Naming > >> -this is something for branding@apache, they will have their opinions. > >> The key one is that the name "Apache Hadoop" must get used, and it's > >> important to make clear it is a derivative work. > >> -I don't think you can claim to have a Distribution/Fork/Version of +
M. C. Srivas 2011-05-12, 02:26
-
Re: Defining Hadoop Compatibility -revisiting-Ted Dunning 2011-05-12, 04:37
As a specific example of how these are important, over in Mahout-land we
have been wrestling with determining just what it means to have dependencies in the lib directory inside a jar. This isn't documented, behaves differently in different versions of Hadoop and means that some Mahout programs work sometimes, but fail in high-profile locations like Twitter. On Wed, May 11, 2011 at 7:26 PM, M. C. Srivas <[EMAIL PROTECTED]> wrote: > Such specifications are vitally important when > implementing libraries like Cascading, Mahout, etc. For example, an > application might open a file using the new API, and pass that stream into > a > library that manipulates the stream using some of the old API ... what is > then the expectation of the state of the stream when the library call > returns? > +
Ted Dunning 2011-05-12, 04:37
-
Re: Defining Hadoop Compatibility -revisiting-Steve Loughran 2011-05-12, 09:32
On 12/05/2011 03:26, M. C. Srivas wrote:
> While the HCK is a great idea to check quickly if an implementation is > "compliant", we still need a written specification to define what is meant > by compliance, something akin to a set of RFC's, or a set of docs like the > IEEE POSIX specifications. > > For example, the POSIX.1c pthreads API has a written document that specifies > all the function calls, input params, return values, and error codes. It > clearly indicates what any POSIX-complaint threads package needs to support, > and what are vendor-specific non-portable extensions that one can use at > one's own risk. I have been known to be critical of standards bodies in the past http://www.waterfall2006.com/loughran.html And I've been in them. It is absolutely essential that the Hadoop stack doesn't become controlled by a standards body, as then you become controlled by whoever can afford to send the most people to the standards events -and make behind the scenes deals with others to get votes through. > Currently we have 2 sets of API in the DFS and Map/Reduce layers, and the > specification is extracted only by looking at the code, or (where the code > is non-trivial) by writing really bizarre test programs to examine corner > cases. Further, the interaction between a mix of the old and new APIs is not > specified anywhere. Such specifications are vitally important when > implementing libraries like Cascading, Mahout, etc. For example, an > application might open a file using the new API, and pass that stream into a > library that manipulates the stream using some of the old API ... what is > then the expectation of the state of the stream when the library call > returns? > > Sanjay Radia @ Y! already started specifying some the DFS APIs to nail such > things down. There's similar good effort in the Map/Reduce and Avro spaces, > but it seems to have stalled somewhat. We should continue it. > > Doing such specs would be a great service to the community and the users of > Hadoop. It provides them > (a) clear-cut docs on how to use the Hadoop APIs# +1 > (b) wider choice of Hadoop implementations by freeing them from vendor > lock-in. =0 They won't be hadoop implementations, they will be "something that is compatible with the Apache Hadoop API as defined in v 0.x of the Hadoop compatibility kit". Furthermore, there's the issue of any google patents -while google have given Hadoop permission to them, that may not apply to other things that implement compatible APIs. I also think that the Hadoop team need to be the one's who own the interfaces and tests, define the tests as a functional test suite for testing Hadoop distributions, and reserve the right to make changes to the interfaces, semantics and tests as suits the teams needs. The input from others -especially related community projects- are important, but, to be ruthless, the compatibility issues with things that aren't really Apache Hadoop are less important. you choose to reimplement Hadoop, you take on the costs of staying current. > > Once we have such specification, the HCK becomes meaningful (since the HCK > itself will be buggy initially). +
Steve Loughran 2011-05-12, 09:32
-
Re: Defining Hadoop Compatibility -revisiting-Segel, Mike 2011-05-12, 09:49
While IANAL...
As long as any implementation follows Apache's license regarding derivative works, it's fair game. (this is my understanding YMMV) The APL is very liberal in what one can do with a derivative work... Surely Apache has some lawyers who can summarize what is allowable when talking about a derivative work and what is not? Note these are my opinions only and do not reflect the opinions of anyone else. Any resemblance to a coherent thought is pure coincidence..... Sent from a remote device. Please excuse any typos... Mike Segel On May 12, 2011, at 4:33 AM, "Steve Loughran" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: (b) wider choice of Hadoop implementations by freeing them from vendor lock-in. =0 They won't be hadoop implementations, they will be "something that is compatible with the Apache Hadoop API as defined in v 0.x of the Hadoop compatibility kit". Furthermore, there's the issue of any google patents -while google have given Hadoop permission to them, that may not apply to other things that implement compatible APIs. The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files. +
Segel, Mike 2011-05-12, 09:49
-
Re: Defining Hadoop Compatibility -revisiting-Eric Baldeschwieler 2011-05-13, 05:05
label:
print "+1"; goto label; I could not agree more with everything you said steve! The Apache Hadoop project should own the definition of Apache Hadoop. Hadoop is far from done. The interfaces need to keep evolving to get to a place where we can be proud of them. I support "vendors" building replacement components for Apache Hadoop components. That will benefit the community, give folks choices and challenge us to make Apache Hadoop even better. I think it is critical that Apache Hadoop remain a living / evolving work that is driven by those who are willing to contribute their work to it and that the result of that evolution is the reference implementation that vendors must match & exceed to play. I'd love to see more effort to add specifications and compatibility tests to Apache Hadoop. We'll continue to invest in specs and see what we can do about tests. I encourage folks who wish to demonstrate compatibility and use the Apache Hadoop trademark with their products to help contribute such work to Apache Hadoop. We should include these things with the code under SVN with our normal patch peer review. On May 12, 2011, at 2:32 AM, Steve Loughran wrote: > On 12/05/2011 03:26, M. C. Srivas wrote: >> While the HCK is a great idea to check quickly if an implementation is >> "compliant", we still need a written specification to define what is meant >> by compliance, something akin to a set of RFC's, or a set of docs like the >> IEEE POSIX specifications. >> >> For example, the POSIX.1c pthreads API has a written document that specifies >> all the function calls, input params, return values, and error codes. It >> clearly indicates what any POSIX-complaint threads package needs to support, >> and what are vendor-specific non-portable extensions that one can use at >> one's own risk. > > I have been known to be critical of standards bodies in the past > http://www.waterfall2006.com/loughran.html > > And I've been in them. It is absolutely essential that the Hadoop stack > doesn't become controlled by a standards body, as then you become > controlled by whoever can afford to send the most people to the > standards events -and make behind the scenes deals with others to get > votes through. > >> Currently we have 2 sets of API in the DFS and Map/Reduce layers, and the >> specification is extracted only by looking at the code, or (where the code >> is non-trivial) by writing really bizarre test programs to examine corner >> cases. Further, the interaction between a mix of the old and new APIs is not >> specified anywhere. Such specifications are vitally important when >> implementing libraries like Cascading, Mahout, etc. For example, an >> application might open a file using the new API, and pass that stream into a >> library that manipulates the stream using some of the old API ... what is >> then the expectation of the state of the stream when the library call >> returns? >> >> Sanjay Radia @ Y! already started specifying some the DFS APIs to nail such >> things down. There's similar good effort in the Map/Reduce and Avro spaces, >> but it seems to have stalled somewhat. We should continue it. >> >> Doing such specs would be a great service to the community and the users of >> Hadoop. It provides them >> (a) clear-cut docs on how to use the Hadoop APIs# > > +1 > >> (b) wider choice of Hadoop implementations by freeing them from vendor >> lock-in. > > =0 > > They won't be hadoop implementations, they will be "something that is > compatible with the Apache Hadoop API as defined in v 0.x of the Hadoop > compatibility kit". Furthermore, there's the issue of any google patents > -while google have given Hadoop permission to them, that may not apply > to other things that implement compatible APIs. > > I also think that the Hadoop team need to be the one's who own the > interfaces and tests, define the tests as a functional test suite for > testing Hadoop distributions, and reserve the right to make changes to +
Eric Baldeschwieler 2011-05-13, 05:05
-
Re: Defining Hadoop Compatibility -revisiting-Milind Bhandarkar 2011-05-12, 16:45
HCK and written specifications are not mutually exclusive. However, given
the evolving nature of Hadoop APIs, functional tests need to evolve as well, and having them tied to a "current stable" version is easier to do than it is to tie the written specifications. - milind -- Milind Bhandarkar [EMAIL PROTECTED] +1-650-776-3167 On 5/11/11 7:26 PM, "M. C. Srivas" <[EMAIL PROTECTED]> wrote: >While the HCK is a great idea to check quickly if an implementation is >"compliant", we still need a written specification to define what is >meant >by compliance, something akin to a set of RFC's, or a set of docs like the > IEEE POSIX specifications. > >For example, the POSIX.1c pthreads API has a written document that >specifies >all the function calls, input params, return values, and error codes. It >clearly indicates what any POSIX-complaint threads package needs to >support, >and what are vendor-specific non-portable extensions that one can use at >one's own risk. > >Currently we have 2 sets of API in the DFS and Map/Reduce layers, and the >specification is extracted only by looking at the code, or (where the code >is non-trivial) by writing really bizarre test programs to examine corner >cases. Further, the interaction between a mix of the old and new APIs is >not >specified anywhere. Such specifications are vitally important when >implementing libraries like Cascading, Mahout, etc. For example, an >application might open a file using the new API, and pass that stream >into a >library that manipulates the stream using some of the old API ... what is >then the expectation of the state of the stream when the library call >returns? > >Sanjay Radia @ Y! already started specifying some the DFS APIs to nail >such >things down. There's similar good effort in the Map/Reduce and Avro >spaces, >but it seems to have stalled somewhat. We should continue it. > >Doing such specs would be a great service to the community and the users >of >Hadoop. It provides them > (a) clear-cut docs on how to use the Hadoop APIs > (b) wider choice of Hadoop implementations by freeing them from vendor >lock-in. > >Once we have such specification, the HCK becomes meaningful (since the HCK >itself will be buggy initially). > > >On Wed, May 11, 2011 at 2:46 PM, Milind Bhandarkar ><[EMAIL PROTECTED] >> wrote: > >> I think it's time to separate out functional tests as a "Hadoop >> Compatibility Kit (HCK)", similar to the Sun TCK for Java, but under ASL >> 2.0. Then "certification" would mean "Passes 100% of the HCK testsuite." >> >> - milind >> -- >> Milind Bhandarkar >> [EMAIL PROTECTED] >> >> >> >> >> >> >> On 5/11/11 2:24 PM, "Eric Baldeschwieler" <[EMAIL PROTECTED]> wrote: >> >> >This is a really interesting topic! I completely agree that we need to >> >get ahead of this. >> > >> >I would be really interested in learning of any experience other apache >> >projects, such as apache or tomcat have with these issues. >> > >> >--- >> >E14 - typing on glass >> > >> >On May 10, 2011, at 6:31 AM, "Steve Loughran" <[EMAIL PROTECTED]> >>wrote: >> > >> >> >> >> Back in Jan 2011, I started a discussion about how to define Apache >> >> Hadoop Compatibility: >> >> >> >> >> >>http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%3C4D >> >>[EMAIL PROTECTED]%3E >> >> >> >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet >> >> >> >> >> >>>>http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_ >>>>1 >> . >> >> >> >> It claims that their implementations are 100% compatible, even though >> >> the Enterprise edition uses a C filesystem. It also claims that both >> >> their software releases contain "Certified Stacks", without defining >> >> what Certified means, or who does the certification -only that it is >>an >> >> improvement. >> >> >> >> >> >> I think we should revisit this issue before people with their own >> >> agendas define what compatibility with Apache Hadoop is for us +
Milind Bhandarkar 2011-05-12, 16:45
-
Re: Defining Hadoop Compatibility -revisiting-Konstantin Boudnik 2011-05-12, 22:30
On Thu, May 12, 2011 at 09:45, Milind Bhandarkar
<[EMAIL PROTECTED]> wrote: > HCK and written specifications are not mutually exclusive. However, given > the evolving nature of Hadoop APIs, functional tests need to evolve as I would actually expand it to 'functional and system tests' because latter are capable of validating inter-component iterations not coverable by functional tests. Cos > well, and having them tied to a "current stable" version is easier to do > than it is to tie the written specifications. > > - milind > > -- > Milind Bhandarkar > [EMAIL PROTECTED] > +1-650-776-3167 > > > > > > > On 5/11/11 7:26 PM, "M. C. Srivas" <[EMAIL PROTECTED]> wrote: > >>While the HCK is a great idea to check quickly if an implementation is >>"compliant", we still need a written specification to define what is >>meant >>by compliance, something akin to a set of RFC's, or a set of docs like the >> IEEE POSIX specifications. >> >>For example, the POSIX.1c pthreads API has a written document that >>specifies >>all the function calls, input params, return values, and error codes. It >>clearly indicates what any POSIX-complaint threads package needs to >>support, >>and what are vendor-specific non-portable extensions that one can use at >>one's own risk. >> >>Currently we have 2 sets of API in the DFS and Map/Reduce layers, and the >>specification is extracted only by looking at the code, or (where the code >>is non-trivial) by writing really bizarre test programs to examine corner >>cases. Further, the interaction between a mix of the old and new APIs is >>not >>specified anywhere. Such specifications are vitally important when >>implementing libraries like Cascading, Mahout, etc. For example, an >>application might open a file using the new API, and pass that stream >>into a >>library that manipulates the stream using some of the old API ... what is >>then the expectation of the state of the stream when the library call >>returns? >> >>Sanjay Radia @ Y! already started specifying some the DFS APIs to nail >>such >>things down. There's similar good effort in the Map/Reduce and Avro >>spaces, >>but it seems to have stalled somewhat. We should continue it. >> >>Doing such specs would be a great service to the community and the users >>of >>Hadoop. It provides them >> (a) clear-cut docs on how to use the Hadoop APIs >> (b) wider choice of Hadoop implementations by freeing them from vendor >>lock-in. >> >>Once we have such specification, the HCK becomes meaningful (since the HCK >>itself will be buggy initially). >> >> >>On Wed, May 11, 2011 at 2:46 PM, Milind Bhandarkar >><[EMAIL PROTECTED] >>> wrote: >> >>> I think it's time to separate out functional tests as a "Hadoop >>> Compatibility Kit (HCK)", similar to the Sun TCK for Java, but under ASL >>> 2.0. Then "certification" would mean "Passes 100% of the HCK testsuite." >>> >>> - milind >>> -- >>> Milind Bhandarkar >>> [EMAIL PROTECTED] >>> >>> >>> >>> >>> >>> >>> On 5/11/11 2:24 PM, "Eric Baldeschwieler" <[EMAIL PROTECTED]> wrote: >>> >>> >This is a really interesting topic! I completely agree that we need to >>> >get ahead of this. >>> > >>> >I would be really interested in learning of any experience other apache >>> >projects, such as apache or tomcat have with these issues. >>> > >>> >--- >>> >E14 - typing on glass >>> > >>> >On May 10, 2011, at 6:31 AM, "Steve Loughran" <[EMAIL PROTECTED]> >>>wrote: >>> > >>> >> >>> >> Back in Jan 2011, I started a discussion about how to define Apache >>> >> Hadoop Compatibility: >>> >> >>> >> >>> >>>http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%3C4D >>> >>[EMAIL PROTECTED]%3E >>> >> >>> >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet >>> >> >>> >> >>> >>>>>http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_ >>>>>1 >>> . >>> >> >>> >> It claims that their implementations are 100% compatible, even though >>> >> the Enterprise edition uses a C filesystem. It also claims that both +
Konstantin Boudnik 2011-05-12, 22:30
-
Re: Defining Hadoop Compatibility -revisiting-Milind Bhandarkar 2011-05-13, 03:40
Cos,
Can you give me an example of a "system test" that is not a functional test ? My assumption was that the functionality being tested is specific to a component, and that inter-component interactions (that's what you meant, right?) would be taken care by the public interface and semantics of a component API. - milind -- Milind Bhandarkar [EMAIL PROTECTED] +1-650-776-3167 On 5/12/11 3:30 PM, "Konstantin Boudnik" <[EMAIL PROTECTED]> wrote: >On Thu, May 12, 2011 at 09:45, Milind Bhandarkar ><[EMAIL PROTECTED]> wrote: >> HCK and written specifications are not mutually exclusive. However, >>given >> the evolving nature of Hadoop APIs, functional tests need to evolve as > >I would actually expand it to 'functional and system tests' because >latter are capable of validating inter-component iterations not >coverable by functional tests. > >Cos > >> well, and having them tied to a "current stable" version is easier to do >> than it is to tie the written specifications. >> >> - milind >> >> -- >> Milind Bhandarkar >> [EMAIL PROTECTED] >> +1-650-776-3167 >> >> >> >> >> >> >> On 5/11/11 7:26 PM, "M. C. Srivas" <[EMAIL PROTECTED]> wrote: >> >>>While the HCK is a great idea to check quickly if an implementation is >>>"compliant", we still need a written specification to define what is >>>meant >>>by compliance, something akin to a set of RFC's, or a set of docs like >>>the >>> IEEE POSIX specifications. >>> >>>For example, the POSIX.1c pthreads API has a written document that >>>specifies >>>all the function calls, input params, return values, and error codes. It >>>clearly indicates what any POSIX-complaint threads package needs to >>>support, >>>and what are vendor-specific non-portable extensions that one can use at >>>one's own risk. >>> >>>Currently we have 2 sets of API in the DFS and Map/Reduce layers, and >>>the >>>specification is extracted only by looking at the code, or (where the >>>code >>>is non-trivial) by writing really bizarre test programs to examine >>>corner >>>cases. Further, the interaction between a mix of the old and new APIs is >>>not >>>specified anywhere. Such specifications are vitally important when >>>implementing libraries like Cascading, Mahout, etc. For example, an >>>application might open a file using the new API, and pass that stream >>>into a >>>library that manipulates the stream using some of the old API ... what >>>is >>>then the expectation of the state of the stream when the library call >>>returns? >>> >>>Sanjay Radia @ Y! already started specifying some the DFS APIs to nail >>>such >>>things down. There's similar good effort in the Map/Reduce and Avro >>>spaces, >>>but it seems to have stalled somewhat. We should continue it. >>> >>>Doing such specs would be a great service to the community and the users >>>of >>>Hadoop. It provides them >>> (a) clear-cut docs on how to use the Hadoop APIs >>> (b) wider choice of Hadoop implementations by freeing them from >>>vendor >>>lock-in. >>> >>>Once we have such specification, the HCK becomes meaningful (since the >>>HCK >>>itself will be buggy initially). >>> >>> >>>On Wed, May 11, 2011 at 2:46 PM, Milind Bhandarkar >>><[EMAIL PROTECTED] >>>> wrote: >>> >>>> I think it's time to separate out functional tests as a "Hadoop >>>> Compatibility Kit (HCK)", similar to the Sun TCK for Java, but under >>>>ASL >>>> 2.0. Then "certification" would mean "Passes 100% of the HCK >>>>testsuite." >>>> >>>> - milind >>>> -- >>>> Milind Bhandarkar >>>> [EMAIL PROTECTED] >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 5/11/11 2:24 PM, "Eric Baldeschwieler" <[EMAIL PROTECTED]> >>>>wrote: >>>> >>>> >This is a really interesting topic! I completely agree that we need >>>>to >>>> >get ahead of this. >>>> > >>>> >I would be really interested in learning of any experience other >>>>apache >>>> >projects, such as apache or tomcat have with these issues. >>>> > >>>> >--- >>>> >E14 - typing on glass >>>> > >>>> >On May 10, 2011, at 6:31 AM, "Steve Loughran" <[EMAIL PROTECTED]> +
Milind Bhandarkar 2011-05-13, 03:40
-
Re: Defining Hadoop Compatibility -revisiting-Konstantin Boudnik 2011-05-13, 06:24
On Thu, May 12, 2011 at 20:40, Milind Bhandarkar
<[EMAIL PROTECTED]> wrote: > Cos, > > Can you give me an example of a "system test" that is not a functional > test ? My assumption was that the functionality being tested is specific > to a component, and that inter-component interactions (that's what you > meant, right?) would be taken care by the public interface and semantics > of a component API. Milind, kinda... However, to exercise inter-component interactions via component APIs one needs to have tests which are beyond functional or component realm (e.g. system). At some point I was part of a team working on integration validation framework for Hadoop (FIT) which was addressing inter-component interaction validations essentially guaranteeing their compatibility. Components being Hadoop, Pig, Oozie, etc. - thus massaging the whole stack of application and covering a lot of use cases. Having a framework like this and a set of test cases available for Hadoop community is a great benefit because one can quickly make sure that a Hadoop stack built from a set of components is working property. Another use case is to run the same set of tests - versioned separately from the product itself - against previous and a next release validating their compatibility at the functional level (sorta what you have mentioned). This doesn't by the way deploy if we'd choose to work on HCK or not, however HCK might be eventually based on top of such a framework. Cos > - milind > > -- > Milind Bhandarkar > [EMAIL PROTECTED] > +1-650-776-3167 > > > > > > > On 5/12/11 3:30 PM, "Konstantin Boudnik" <[EMAIL PROTECTED]> wrote: > >>On Thu, May 12, 2011 at 09:45, Milind Bhandarkar >><[EMAIL PROTECTED]> wrote: >>> HCK and written specifications are not mutually exclusive. However, >>>given >>> the evolving nature of Hadoop APIs, functional tests need to evolve as >> >>I would actually expand it to 'functional and system tests' because >>latter are capable of validating inter-component iterations not >>coverable by functional tests. >> >>Cos >> >>> well, and having them tied to a "current stable" version is easier to do >>> than it is to tie the written specifications. >>> >>> - milind >>> >>> -- >>> Milind Bhandarkar >>> [EMAIL PROTECTED] >>> +1-650-776-3167 >>> >>> >>> >>> >>> >>> >>> On 5/11/11 7:26 PM, "M. C. Srivas" <[EMAIL PROTECTED]> wrote: >>> >>>>While the HCK is a great idea to check quickly if an implementation is >>>>"compliant", we still need a written specification to define what is >>>>meant >>>>by compliance, something akin to a set of RFC's, or a set of docs like >>>>the >>>> IEEE POSIX specifications. >>>> >>>>For example, the POSIX.1c pthreads API has a written document that >>>>specifies >>>>all the function calls, input params, return values, and error codes. It >>>>clearly indicates what any POSIX-complaint threads package needs to >>>>support, >>>>and what are vendor-specific non-portable extensions that one can use at >>>>one's own risk. >>>> >>>>Currently we have 2 sets of API in the DFS and Map/Reduce layers, and >>>>the >>>>specification is extracted only by looking at the code, or (where the >>>>code >>>>is non-trivial) by writing really bizarre test programs to examine >>>>corner >>>>cases. Further, the interaction between a mix of the old and new APIs is >>>>not >>>>specified anywhere. Such specifications are vitally important when >>>>implementing libraries like Cascading, Mahout, etc. For example, an >>>>application might open a file using the new API, and pass that stream >>>>into a >>>>library that manipulates the stream using some of the old API ... what >>>>is >>>>then the expectation of the state of the stream when the library call >>>>returns? >>>> >>>>Sanjay Radia @ Y! already started specifying some the DFS APIs to nail >>>>such >>>>things down. There's similar good effort in the Map/Reduce and Avro >>>>spaces, >>>>but it seems to have stalled somewhat. We should continue it. >>>> > +
Konstantin Boudnik 2011-05-13, 06:24
-
Re: Defining Hadoop Compatibility -revisiting-Milind Bhandarkar 2011-05-13, 07:11
Cos,
I remember the issues about the "inter-component interactions" at that point when you were part of the Yahoo Hadoop FIT team (I was on the other side of the same floor, remember ? ;-) Things like, "Can Pig take full URIs as input, and so works with viewfs", "Can Local jobtracker still use HDFS as input and output", "Can Oozie use local file system to keep workflows, while the jars were located on hdfs" etc came up often. Each of these issues were component-interaction issues, and were results of making DistributedFileSystem a public class, or some subtle dependency on the semantics of a particular method in an interface, which were not explicit in the syntax. That's an issue with interface-compatibility, and so merely compiling against a particular interface is not a solution. One needs a test-suite. (With annotations in Java, one can impose more semantic restrictions on the interface, that can be automatically checked against at runtime. But is limited to individual methods, or the full class. Code generation using perl or whatever is similar in capability.) - milind -- Milind Bhandarkar [EMAIL PROTECTED] +1-650-776-3167 On 5/12/11 11:24 PM, "Konstantin Boudnik" <[EMAIL PROTECTED]> wrote: >On Thu, May 12, 2011 at 20:40, Milind Bhandarkar ><[EMAIL PROTECTED]> wrote: >> Cos, >> >> Can you give me an example of a "system test" that is not a functional >> test ? My assumption was that the functionality being tested is specific >> to a component, and that inter-component interactions (that's what you >> meant, right?) would be taken care by the public interface and semantics >> of a component API. > >Milind, kinda... However, to exercise inter-component interactions via >component APIs one needs to have tests which are beyond functional or >component realm (e.g. system). At some point I was part of a team >working on integration validation framework for Hadoop (FIT) which was >addressing inter-component interaction validations essentially >guaranteeing their compatibility. Components being Hadoop, Pig, Oozie, >etc. - thus massaging the whole stack of application and covering a >lot of use cases. > >Having a framework like this and a set of test cases available for >Hadoop community is a great benefit because one can quickly make sure >that a Hadoop stack built from a set of components is working >property. Another use case is to run the same set of tests - versioned >separately from the product itself - against previous and a next >release validating their compatibility at the functional level (sorta >what you have mentioned). > >This doesn't by the way deploy if we'd choose to work on HCK or not, >however HCK might be eventually based on top of such a framework. > >Cos > >> - milind >> >> -- >> Milind Bhandarkar >> [EMAIL PROTECTED] >> +1-650-776-3167 >> >> >> >> >> >> >> On 5/12/11 3:30 PM, "Konstantin Boudnik" <[EMAIL PROTECTED]> wrote: >> >>>On Thu, May 12, 2011 at 09:45, Milind Bhandarkar >>><[EMAIL PROTECTED]> wrote: >>>> HCK and written specifications are not mutually exclusive. However, >>>>given >>>> the evolving nature of Hadoop APIs, functional tests need to evolve as >>> >>>I would actually expand it to 'functional and system tests' because >>>latter are capable of validating inter-component iterations not >>>coverable by functional tests. >>> >>>Cos >>> >>>> well, and having them tied to a "current stable" version is easier to >>>>do >>>> than it is to tie the written specifications. >>>> >>>> - milind >>>> >>>> -- >>>> Milind Bhandarkar >>>> [EMAIL PROTECTED] >>>> +1-650-776-3167 >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 5/11/11 7:26 PM, "M. C. Srivas" <[EMAIL PROTECTED]> wrote: >>>> >>>>>While the HCK is a great idea to check quickly if an implementation is >>>>>"compliant", we still need a written specification to define what is >>>>>meant >>>>>by compliance, something akin to a set of RFC's, or a set of docs like >>>>>the >>>>> IEEE POSIX specifications. >>>>> >>>>>For example, the POSIX.1c pthreads API has a written document that +
Milind Bhandarkar 2011-05-13, 07:11
-
Re: Defining Hadoop Compatibility -revisiting-Konstantin Boudnik 2011-05-13, 17:47
On Fri, May 13, 2011 at 00:11, Milind Bhandarkar
<[EMAIL PROTECTED]> wrote: > Cos, > > I remember the issues about the "inter-component interactions" at that > point when you were part of the Yahoo Hadoop FIT team (I was on the other > side of the same floor, remember ? ;-) Vaguely ;) Of course I remember. But I prefer not to mentioned any internal technologies developed for private companies after getting lashes for that. > Things like, "Can Pig take full URIs as input, and so works with viewfs", > "Can Local jobtracker still use HDFS as input and output", "Can Oozie use > local file system to keep workflows, while the jars were located on hdfs" > etc came up often. > > Each of these issues were component-interaction issues, and were results > of making DistributedFileSystem a public class, or some subtle dependency > on the semantics of a particular method in an interface, which were not > explicit in the syntax. > > That's an issue with interface-compatibility, and so merely compiling > against a particular interface is not a solution. One needs a test-suite. One needs more than a mere test-suite if experience teaches us anything. FIT and its continuation turns to be a complex program (not only in a sense of computer code) with many moving parts, bells and whistles. One of those was a set of specs actually written in English language. The downside is that someone needs to keep them up to day, translate them into test cases or teach others how to do it, etc. That exactly why TCK was using a test generator and used somewhat formalized spec language. Cos > (With annotations in Java, one can impose more semantic restrictions on > the interface, that can be automatically checked against at runtime. But > is limited to individual methods, or the full class. Code generation using > perl or whatever is similar in capability.) > > - milind > -- > Milind Bhandarkar > [EMAIL PROTECTED] > +1-650-776-3167 > > > > > > > On 5/12/11 11:24 PM, "Konstantin Boudnik" <[EMAIL PROTECTED]> wrote: > >>On Thu, May 12, 2011 at 20:40, Milind Bhandarkar >><[EMAIL PROTECTED]> wrote: >>> Cos, >>> >>> Can you give me an example of a "system test" that is not a functional >>> test ? My assumption was that the functionality being tested is specific >>> to a component, and that inter-component interactions (that's what you >>> meant, right?) would be taken care by the public interface and semantics >>> of a component API. >> >>Milind, kinda... However, to exercise inter-component interactions via >>component APIs one needs to have tests which are beyond functional or >>component realm (e.g. system). At some point I was part of a team >>working on integration validation framework for Hadoop (FIT) which was >>addressing inter-component interaction validations essentially >>guaranteeing their compatibility. Components being Hadoop, Pig, Oozie, >>etc. - thus massaging the whole stack of application and covering a >>lot of use cases. >> >>Having a framework like this and a set of test cases available for >>Hadoop community is a great benefit because one can quickly make sure >>that a Hadoop stack built from a set of components is working >>property. Another use case is to run the same set of tests - versioned >>separately from the product itself - against previous and a next >>release validating their compatibility at the functional level (sorta >>what you have mentioned). >> >>This doesn't by the way deploy if we'd choose to work on HCK or not, >>however HCK might be eventually based on top of such a framework. >> >>Cos >> >>> - milind >>> >>> -- >>> Milind Bhandarkar >>> [EMAIL PROTECTED] >>> +1-650-776-3167 >>> >>> >>> >>> >>> >>> >>> On 5/12/11 3:30 PM, "Konstantin Boudnik" <[EMAIL PROTECTED]> wrote: >>> >>>>On Thu, May 12, 2011 at 09:45, Milind Bhandarkar >>>><[EMAIL PROTECTED]> wrote: >>>>> HCK and written specifications are not mutually exclusive. However, >>>>>given >>>>> the evolving nature of Hadoop APIs, functional tests need to evolve as +
Konstantin Boudnik 2011-05-13, 17:47
-
Re: Defining Hadoop Compatibility -revisiting-Ian Holsman 2011-05-11, 22:42
For apache (httpd I'm assuming you mean). we define compatibility as adherence to the set of RFC's that define the HTTP protocol.
I'm no expert in this (Roy is though), but we could attempt to do something similar when it comes to HDFS/Map-Reduce protocols. I'm not sure what benefit there would be to going to a RFC, as opposed to documenting the API on our site. On May 12, 2011, at 7:24 AM, Eric Baldeschwieler wrote: > This is a really interesting topic! I completely agree that we need to get ahead of this. > > I would be really interested in learning of any experience other apache projects, such as apache or tomcat have with these issues. > > --- > E14 - typing on glass > > On May 10, 2011, at 6:31 AM, "Steve Loughran" <[EMAIL PROTECTED]> wrote: > >> >> Back in Jan 2011, I started a discussion about how to define Apache >> Hadoop Compatibility: >> http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%[EMAIL PROTECTED]%3E >> >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet >> >> http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_1.pdf >> >> It claims that their implementations are 100% compatible, even though >> the Enterprise edition uses a C filesystem. It also claims that both >> their software releases contain "Certified Stacks", without defining >> what Certified means, or who does the certification -only that it is an >> improvement. >> >> >> I think we should revisit this issue before people with their own >> agendas define what compatibility with Apache Hadoop is for us >> >> >> Licensing >> -Use of the Hadoop codebase must follow the Apache License >> http://www.apache.org/licenses/LICENSE-2.0 >> -plug in components that are dynamically linked to (Filesystems and >> schedulers) don't appear to be derivative works on my reading of this, >> >> Naming >> -this is something for branding@apache, they will have their opinions. >> The key one is that the name "Apache Hadoop" must get used, and it's >> important to make clear it is a derivative work. >> -I don't think you can claim to have a Distribution/Fork/Version of >> Apache Hadoop if you swap out big chunks of it for alternate >> filesystems, MR engines, etc. Some description of this is needed >> "Supports the Apache Hadoop MapReduce engine on top of Filesystem XYZ" >> >> Compatibility >> -the definition of the Hadoop interfaces and classes is the Apache >> Source tree, >> -the definition of semantics of the Hadoop interfaces and classes is >> the Apache Source tree, including the test classes. >> -the verification that the actual semantics of an Apache Hadoop >> release is compatible with the expected semantics is that current and >> future tests pass >> -bug reports can highlight incompatibility with expectations of >> community users, and once incorporated into tests form part of the >> compatibility testing >> -vendors can claim and even certify their derivative works as >> compatible with other versions of their derivative works, but cannot >> claim compatibility with Apache Hadoop unless their code passes the >> tests and is consistent with the bug reports marked as ("by design"). >> Perhaps we should have tests that verify each of these "by design" >> bugreps to make them more formal. >> >> Certification >> -I have no idea what this means in EMC's case, they just say "Certified" >> -As we don't do any certification ourselves, it would seem impossible >> for us to certify that any derivative work is compatible. >> -It may be best to state that nobody can certify their derivative as >> "compatible with Apache Hadoop" unless it passes all current test suites >> -And require that anyone who declares compatibility define what they >> mean by this >> >> This is a good argument for getting more functional tests out there >> -whoever has more functional tests needs to get them into a test module >> that can be used to test real deployments. >> +
Ian Holsman 2011-05-11, 22:42
-
Re: Defining Hadoop Compatibility -revisiting-Jacob R Rideout 2011-05-11, 22:56
What about defining compatibility as fully implementing all the
public-stable annotated interfaces for a particular release? Jacob Rideout On Wed, May 11, 2011 at 4:42 PM, Ian Holsman <[EMAIL PROTECTED]> wrote: > For apache (httpd I'm assuming you mean). we define compatibility as adherence to the set of RFC's that define the HTTP protocol. > > I'm no expert in this (Roy is though), but we could attempt to do something similar when it comes to HDFS/Map-Reduce protocols. I'm not sure what benefit there would be to going to a RFC, as opposed to documenting the API on our site. > > > On May 12, 2011, at 7:24 AM, Eric Baldeschwieler wrote: > >> This is a really interesting topic! I completely agree that we need to get ahead of this. >> >> I would be really interested in learning of any experience other apache projects, such as apache or tomcat have with these issues. >> >> --- >> E14 - typing on glass >> >> On May 10, 2011, at 6:31 AM, "Steve Loughran" <[EMAIL PROTECTED]> wrote: >> >>> >>> Back in Jan 2011, I started a discussion about how to define Apache >>> Hadoop Compatibility: >>> http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%[EMAIL PROTECTED]%3E >>> >>> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet >>> >>> http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_1.pdf >>> >>> It claims that their implementations are 100% compatible, even though >>> the Enterprise edition uses a C filesystem. It also claims that both >>> their software releases contain "Certified Stacks", without defining >>> what Certified means, or who does the certification -only that it is an >>> improvement. >>> >>> >>> I think we should revisit this issue before people with their own >>> agendas define what compatibility with Apache Hadoop is for us >>> >>> >>> Licensing >>> -Use of the Hadoop codebase must follow the Apache License >>> http://www.apache.org/licenses/LICENSE-2.0 >>> -plug in components that are dynamically linked to (Filesystems and >>> schedulers) don't appear to be derivative works on my reading of this, >>> >>> Naming >>> -this is something for branding@apache, they will have their opinions. >>> The key one is that the name "Apache Hadoop" must get used, and it's >>> important to make clear it is a derivative work. >>> -I don't think you can claim to have a Distribution/Fork/Version of >>> Apache Hadoop if you swap out big chunks of it for alternate >>> filesystems, MR engines, etc. Some description of this is needed >>> "Supports the Apache Hadoop MapReduce engine on top of Filesystem XYZ" >>> >>> Compatibility >>> -the definition of the Hadoop interfaces and classes is the Apache >>> Source tree, >>> -the definition of semantics of the Hadoop interfaces and classes is >>> the Apache Source tree, including the test classes. >>> -the verification that the actual semantics of an Apache Hadoop >>> release is compatible with the expected semantics is that current and >>> future tests pass >>> -bug reports can highlight incompatibility with expectations of >>> community users, and once incorporated into tests form part of the >>> compatibility testing >>> -vendors can claim and even certify their derivative works as >>> compatible with other versions of their derivative works, but cannot >>> claim compatibility with Apache Hadoop unless their code passes the >>> tests and is consistent with the bug reports marked as ("by design"). >>> Perhaps we should have tests that verify each of these "by design" >>> bugreps to make them more formal. >>> >>> Certification >>> -I have no idea what this means in EMC's case, they just say "Certified" >>> -As we don't do any certification ourselves, it would seem impossible >>> for us to certify that any derivative work is compatible. >>> -It may be best to state that nobody can certify their derivative as >>> "compatible with Apache Hadoop" unless it passes all current test suites >>> -And require that anyone who declares compatibility define what they +
Jacob R Rideout 2011-05-11, 22:56
-
Re: Defining Hadoop Compatibility -revisiting-Aaron Kimball 2011-05-11, 23:20
What does it mean to "implement" those interfaces? I'm +1 for a TCK-based
definition. In addition to statically implementing a set of interfaces, each interface also implicitly includes a set of acceptable inputs and predicted outputs (or ranges of outputs) for those inputs. - Aaron On Wed, May 11, 2011 at 3:56 PM, Jacob R Rideout <[EMAIL PROTECTED]>wrote: > What about defining compatibility as fully implementing all the > public-stable annotated interfaces for a particular release? > > Jacob Rideout > > On Wed, May 11, 2011 at 4:42 PM, Ian Holsman <[EMAIL PROTECTED]> wrote: > > For apache (httpd I'm assuming you mean). we define compatibility as > adherence to the set of RFC's that define the HTTP protocol. > > > > I'm no expert in this (Roy is though), but we could attempt to do > something similar when it comes to HDFS/Map-Reduce protocols. I'm not sure > what benefit there would be to going to a RFC, as opposed to documenting the > API on our site. > > > > > > On May 12, 2011, at 7:24 AM, Eric Baldeschwieler wrote: > > > >> This is a really interesting topic! I completely agree that we need to > get ahead of this. > >> > >> I would be really interested in learning of any experience other apache > projects, such as apache or tomcat have with these issues. > >> > >> --- > >> E14 - typing on glass > >> > >> On May 10, 2011, at 6:31 AM, "Steve Loughran" <[EMAIL PROTECTED]> > wrote: > >> > >>> > >>> Back in Jan 2011, I started a discussion about how to define Apache > >>> Hadoop Compatibility: > >>> > http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%[EMAIL PROTECTED]%3E > >>> > >>> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet > >>> > >>> > http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_1.pdf > >>> > >>> It claims that their implementations are 100% compatible, even though > >>> the Enterprise edition uses a C filesystem. It also claims that both > >>> their software releases contain "Certified Stacks", without defining > >>> what Certified means, or who does the certification -only that it is an > >>> improvement. > >>> > >>> > >>> I think we should revisit this issue before people with their own > >>> agendas define what compatibility with Apache Hadoop is for us > >>> > >>> > >>> Licensing > >>> -Use of the Hadoop codebase must follow the Apache License > >>> http://www.apache.org/licenses/LICENSE-2.0 > >>> -plug in components that are dynamically linked to (Filesystems and > >>> schedulers) don't appear to be derivative works on my reading of this, > >>> > >>> Naming > >>> -this is something for branding@apache, they will have their opinions. > >>> The key one is that the name "Apache Hadoop" must get used, and it's > >>> important to make clear it is a derivative work. > >>> -I don't think you can claim to have a Distribution/Fork/Version of > >>> Apache Hadoop if you swap out big chunks of it for alternate > >>> filesystems, MR engines, etc. Some description of this is needed > >>> "Supports the Apache Hadoop MapReduce engine on top of Filesystem XYZ" > >>> > >>> Compatibility > >>> -the definition of the Hadoop interfaces and classes is the Apache > >>> Source tree, > >>> -the definition of semantics of the Hadoop interfaces and classes is > >>> the Apache Source tree, including the test classes. > >>> -the verification that the actual semantics of an Apache Hadoop > >>> release is compatible with the expected semantics is that current and > >>> future tests pass > >>> -bug reports can highlight incompatibility with expectations of > >>> community users, and once incorporated into tests form part of the > >>> compatibility testing > >>> -vendors can claim and even certify their derivative works as > >>> compatible with other versions of their derivative works, but cannot > >>> claim compatibility with Apache Hadoop unless their code passes the > >>> tests and is consistent with the bug reports marked as ("by design"). > >>> Perhaps we should have tests that verify each of these "by design" +
Aaron Kimball 2011-05-11, 23:20
-
Re: Defining Hadoop Compatibility -revisiting-Steve Loughran 2011-05-12, 09:33
On 12/05/2011 00:20, Aaron Kimball wrote:
> What does it mean to "implement" those interfaces? I'm +1 for a TCK-based > definition. In addition to statically implementing a set of interfaces, each > interface also implicitly includes a set of acceptable inputs and predicted > outputs (or ranges of outputs) for those inputs. > +1: Parnas's definition of "interface" is signature+semantics. Java interface files just define the signature, not behaviour. A test kit is needed. +
Steve Loughran 2011-05-12, 09:33
-
Re: Defining Hadoop Compatibility -revisiting-Konstantin Boudnik 2011-05-12, 22:26
TCK (or JCK initially) was done as a tool to basically compare Java
Lang specs with a particular implementation including but not limited to an extensive suite of say compiler tests. So I assume before we can embark on any sort of HCK suite some formal specs would have to be defined. It's rather hard to say that implementation X is(not) compatible with Apache Hadoop for the lack of API and spec level definition of what really comprise such an animal. As was mentioned someplace else in the thread there's certain effort happening to document DFS, MR, and Avro APIs. Seems like a very good start for Hadoop specs at large. -- Take care, Konstantin (Cos) Boudnik On Wed, May 11, 2011 at 16:20, Aaron Kimball <[EMAIL PROTECTED]> wrote: > What does it mean to "implement" those interfaces? I'm +1 for a TCK-based > definition. In addition to statically implementing a set of interfaces, each > interface also implicitly includes a set of acceptable inputs and predicted > outputs (or ranges of outputs) for those inputs. > > - Aaron > > On Wed, May 11, 2011 at 3:56 PM, Jacob R Rideout <[EMAIL PROTECTED]>wrote: > >> What about defining compatibility as fully implementing all the >> public-stable annotated interfaces for a particular release? >> >> Jacob Rideout >> >> On Wed, May 11, 2011 at 4:42 PM, Ian Holsman <[EMAIL PROTECTED]> wrote: >> > For apache (httpd I'm assuming you mean). we define compatibility as >> adherence to the set of RFC's that define the HTTP protocol. >> > >> > I'm no expert in this (Roy is though), but we could attempt to do >> something similar when it comes to HDFS/Map-Reduce protocols. I'm not sure >> what benefit there would be to going to a RFC, as opposed to documenting the >> API on our site. >> > >> > >> > On May 12, 2011, at 7:24 AM, Eric Baldeschwieler wrote: >> > >> >> This is a really interesting topic! I completely agree that we need to >> get ahead of this. >> >> >> >> I would be really interested in learning of any experience other apache >> projects, such as apache or tomcat have with these issues. >> >> >> >> --- >> >> E14 - typing on glass >> >> >> >> On May 10, 2011, at 6:31 AM, "Steve Loughran" <[EMAIL PROTECTED]> >> wrote: >> >> >> >>> >> >>> Back in Jan 2011, I started a discussion about how to define Apache >> >>> Hadoop Compatibility: >> >>> >> http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%[EMAIL PROTECTED]%3E >> >>> >> >>> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet >> >>> >> >>> >> http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_1.pdf >> >>> >> >>> It claims that their implementations are 100% compatible, even though >> >>> the Enterprise edition uses a C filesystem. It also claims that both >> >>> their software releases contain "Certified Stacks", without defining >> >>> what Certified means, or who does the certification -only that it is an >> >>> improvement. >> >>> >> >>> >> >>> I think we should revisit this issue before people with their own >> >>> agendas define what compatibility with Apache Hadoop is for us >> >>> >> >>> >> >>> Licensing >> >>> -Use of the Hadoop codebase must follow the Apache License >> >>> http://www.apache.org/licenses/LICENSE-2.0 >> >>> -plug in components that are dynamically linked to (Filesystems and >> >>> schedulers) don't appear to be derivative works on my reading of this, >> >>> >> >>> Naming >> >>> -this is something for branding@apache, they will have their opinions. >> >>> The key one is that the name "Apache Hadoop" must get used, and it's >> >>> important to make clear it is a derivative work. >> >>> -I don't think you can claim to have a Distribution/Fork/Version of >> >>> Apache Hadoop if you swap out big chunks of it for alternate >> >>> filesystems, MR engines, etc. Some description of this is needed >> >>> "Supports the Apache Hadoop MapReduce engine on top of Filesystem XYZ" >> >>> >> >>> Compatibility >> >>> -the definition of the Hadoop interfaces and classes is the Apache +
Konstantin Boudnik 2011-05-12, 22:26
-
Re: Defining Hadoop Compatibility -revisiting-Milind Bhandarkar 2011-05-13, 03:37
The problem with (only) specs is that they are written in natural
language, and subject to human interpretation, and since humans are bad at natural language interpretation, this gives rise to something called standards bodies and lawyers, and that has never been good for anyone in the past ;-) Now consider this scenario: $ bin/hadoop jar hck-0.20.2.jar --config <myconfig/dir> ... Bunch of output ... Result: Tests run: 1000, Successful: 999 Failed: 1 This is much easier to interpret, even for humans. The intention of formally defining compatibility is so that the programs written for Apache Hadoop run unmodified for other open-source / closed-source systems that claim to be "Apache Hadoop Compatible". Unless it can be verified easily, the compatibility definition has no meaning. So, standards that are only documented are useless. By the way, one should also define "Apache Hadoop Source Compatible", and "Apache Hadoop Binary Compatible", depending on whether one recompiles src/hck/**.java and rebuilds hck.jar or not. - milind -- Milind Bhandarkar [EMAIL PROTECTED] +1-650-776-3167 On 5/12/11 3:26 PM, "Konstantin Boudnik" <[EMAIL PROTECTED]> wrote: >TCK (or JCK initially) was done as a tool to basically compare Java >Lang specs with a particular implementation including but not limited >to an extensive suite of say compiler tests. > >So I assume before we can embark on any sort of HCK suite some formal >specs would have to be defined. It's rather hard to say that >implementation X is(not) compatible with Apache Hadoop for the lack of >API and spec level definition of what really comprise such an animal. > >As was mentioned someplace else in the thread there's certain effort >happening to document DFS, MR, and Avro APIs. Seems like a very good >start for Hadoop specs at large. >-- > Take care, >Konstantin (Cos) Boudnik > >On Wed, May 11, 2011 at 16:20, Aaron Kimball <[EMAIL PROTECTED]> wrote: >> What does it mean to "implement" those interfaces? I'm +1 for a >>TCK-based >> definition. In addition to statically implementing a set of interfaces, >>each >> interface also implicitly includes a set of acceptable inputs and >>predicted >> outputs (or ranges of outputs) for those inputs. >> >> - Aaron >> >> On Wed, May 11, 2011 at 3:56 PM, Jacob R Rideout >><[EMAIL PROTECTED]>wrote: >> >>> What about defining compatibility as fully implementing all the >>> public-stable annotated interfaces for a particular release? >>> >>> Jacob Rideout >>> >>> On Wed, May 11, 2011 at 4:42 PM, Ian Holsman <[EMAIL PROTECTED]> >>>wrote: >>> > For apache (httpd I'm assuming you mean). we define compatibility as >>> adherence to the set of RFC's that define the HTTP protocol. >>> > >>> > I'm no expert in this (Roy is though), but we could attempt to do >>> something similar when it comes to HDFS/Map-Reduce protocols. I'm not >>>sure >>> what benefit there would be to going to a RFC, as opposed to >>>documenting the >>> API on our site. >>> > >>> > >>> > On May 12, 2011, at 7:24 AM, Eric Baldeschwieler wrote: >>> > >>> >> This is a really interesting topic! I completely agree that we >>>need to >>> get ahead of this. >>> >> >>> >> I would be really interested in learning of any experience other >>>apache >>> projects, such as apache or tomcat have with these issues. >>> >> >>> >> --- >>> >> E14 - typing on glass >>> >> >>> >> On May 10, 2011, at 6:31 AM, "Steve Loughran" <[EMAIL PROTECTED]> >>> wrote: >>> >> >>> >>> >>> >>> Back in Jan 2011, I started a discussion about how to define Apache >>> >>> Hadoop Compatibility: >>> >>> >>> >>>http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%3C4 >>>[EMAIL PROTECTED]%3E >>> >>> >>> >>> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet >>> >>> >>> >>> >>> >>>http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_1 >>> >>> >>> >>> It claims that their implementations are 100% compatible, even >>>though >>> >>> the Enterprise edition uses a C filesystem. It also claims that +
Milind Bhandarkar 2011-05-13, 03:37
-
Re: Defining Hadoop Compatibility -revisiting-Ted Dunning 2011-05-13, 04:05
Did anybody propose natural language only specifications?
On Thu, May 12, 2011 at 8:37 PM, Milind Bhandarkar <[EMAIL PROTECTED] > wrote: > The problem with (only) specs is that they are written in natural > language, and subject to human interpretation, > +
Ted Dunning 2011-05-13, 04:05
-
Re: Defining Hadoop Compatibility -revisiting-Milind Bhandarkar 2011-05-13, 04:52
Ok, my mistake. They have only asked for documented specifications. I may
have been influenced by all the specifications I have read. All of them were in English, which is characterized as a natural language. But then, if you are proposing a specification in a non-natural-language, isn't that called a test suite ? Or is there a middle ground ? - milind -- Milind Bhandarkar [EMAIL PROTECTED] +1-650-776-3167 On 5/12/11 9:05 PM, "Ted Dunning" <[EMAIL PROTECTED]> wrote: >Did anybody propose natural language only specifications? > >On Thu, May 12, 2011 at 8:37 PM, Milind Bhandarkar ><[EMAIL PROTECTED] >> wrote: > >> The problem with (only) specs is that they are written in natural >> language, and subject to human interpretation, >> +
Milind Bhandarkar 2011-05-13, 04:52
-
Re: Defining Hadoop Compatibility -revisiting-Ted Dunning 2011-05-13, 05:38
I would say that an English spec with associated test suite is a middle
ground. On Thu, May 12, 2011 at 9:52 PM, Milind Bhandarkar <[EMAIL PROTECTED] > wrote: > Ok, my mistake. They have only asked for documented specifications. I may > have been influenced by all the specifications I have read. All of them > were in English, which is characterized as a natural language. > > But then, if you are proposing a specification in a non-natural-language, > isn't that called a test suite ? Or is there a middle ground ? > > - milind > > -- > Milind Bhandarkar > [EMAIL PROTECTED] > +1-650-776-3167 > > > > > > > On 5/12/11 9:05 PM, "Ted Dunning" <[EMAIL PROTECTED]> wrote: > > >Did anybody propose natural language only specifications? > +
Ted Dunning 2011-05-13, 05:38
-
Re: Defining Hadoop Compatibility -revisiting-Konstantin Boudnik 2011-05-13, 06:12
The way it has been done in JCK was a specs written in somewhat
formalized language and a tool (called testgen, written in Perl if I remember correctly) which was dynamically generating a lot of lang tests. I think this is a middle ground Milind has mentioned. BTW, it was a _huge_ effort: Sun had two team working on TCK - ~40+ people - working for a few years on that thing. -- Take care, Konstantin (Cos) Boudnik On Thu, May 12, 2011 at 22:38, Ted Dunning <[EMAIL PROTECTED]> wrote: > I would say that an English spec with associated test suite is a middle > ground. > > On Thu, May 12, 2011 at 9:52 PM, Milind Bhandarkar <[EMAIL PROTECTED] >> wrote: > >> Ok, my mistake. They have only asked for documented specifications. I may >> have been influenced by all the specifications I have read. All of them >> were in English, which is characterized as a natural language. >> >> But then, if you are proposing a specification in a non-natural-language, >> isn't that called a test suite ? Or is there a middle ground ? >> >> - milind >> >> -- >> Milind Bhandarkar >> [EMAIL PROTECTED] >> +1-650-776-3167 >> >> >> >> >> >> >> On 5/12/11 9:05 PM, "Ted Dunning" <[EMAIL PROTECTED]> wrote: >> >> >Did anybody propose natural language only specifications? >> > +
Konstantin Boudnik 2011-05-13, 06:12
-
Re: Defining Hadoop Compatibility -revisiting-Milind Bhandarkar 2011-05-13, 06:57
Sure. As I said before, they are not mutually exclusive. Just stating my
experience that specs without a test suite are of no use. If I were to prioritize, I would give priority to a TCK over natural-language specs. That's all. So far, I have seen many replacements for HDFS as InputFormat and OutputFormat that reads from or writes to different data sources and syncs only. It is easily imaginable to have a pluggable app managers and resource manager after MR-279 (other than local, which is part of Apache Hadoop, but not "compatible", think distributed cache). So, we would need a spec and a test suite per component (I.e. App manager, resource manager, current scheduler, replication target chooser, authentication, authorization) now. If the binary protocols were to be crystallized, I can imagine others implementing only the datanode, or a task tracker. So we would need protocol-level compatibility suite for individual daemons as well. I agree with one of the statements that Steve L made, that "Hadoop has an enviable problem of too much activity." If one follows the activities in commercial world, open source, academic and industry-sponsored R&D, one quickly realizes that writing RFCs for all the above components and fixing them without versioning is cumbersome and difficult optimistically, and near impossible realistically. Also, my experience is that keeping standards documentation for an evolving technology up-to-date with the proper implementation is a pipe-dream at best. A test suite that gets compiled and run every time a new version comes out is within the realm of possibility. Therefore, all I am saying is that, while a POSIX-like spec is a "nice to have", a test-suite that defines compatibility is a must. - milind -- Milind Bhandarkar [EMAIL PROTECTED] +1-650-776-3167 On 5/12/11 10:38 PM, "Ted Dunning" <[EMAIL PROTECTED]> wrote: >I would say that an English spec with associated test suite is a middle >ground. > >On Thu, May 12, 2011 at 9:52 PM, Milind Bhandarkar ><[EMAIL PROTECTED] >> wrote: > >> Ok, my mistake. They have only asked for documented specifications. I >>may >> have been influenced by all the specifications I have read. All of them >> were in English, which is characterized as a natural language. >> >> But then, if you are proposing a specification in a >>non-natural-language, >> isn't that called a test suite ? Or is there a middle ground ? >> >> - milind >> >> -- >> Milind Bhandarkar >> [EMAIL PROTECTED] >> +1-650-776-3167 >> >> >> >> >> >> >> On 5/12/11 9:05 PM, "Ted Dunning" <[EMAIL PROTECTED]> wrote: >> >> >Did anybody propose natural language only specifications? >> +
Milind Bhandarkar 2011-05-13, 06:57
-
Re: Defining Hadoop Compatibility -revisiting-Eric Baldeschwieler 2011-05-16, 05:34
Good point.
Tests are a must for the Hadoop community to meet its own goals (quality and backwards compatibility). Writing detailed specs for something that is evolving this quickly is challenging. Also in a lot of cases, documenting the current APIs to POSIX like detail will mainly convince us that we need to design new APIs that are more like POSIX. The good news is that this is open source and we can crowd source both activities, although it will still require a lot of work from our committers to validate and integrate this sort of contribution. E14 On May 12, 2011, at 11:57 PM, Milind Bhandarkar wrote: ... > Therefore, all I am saying is that, while a POSIX-like spec is a "nice to > have", a test-suite that defines compatibility is a must. > > - milind +
Eric Baldeschwieler 2011-05-16, 05:34
-
Re: Defining Hadoop Compatibility -revisiting-Steve Loughran 2011-05-16, 10:50
On 13/05/11 05:52, Milind Bhandarkar wrote:
> Ok, my mistake. They have only asked for documented specifications. I may > have been influenced by all the specifications I have read. All of them > were in English, which is characterized as a natural language. > > But then, if you are proposing a specification in a non-natural-language, > isn't that called a test suite ? Or is there a middle ground ? There's formal specifications in languages like Z, We don't really want to go there if we can help it, as all it lets you do is prove correctness if you're a mathematician, and I haven't found the mathematician plugin for Jenkins yet. There's also languages like Extended ML, from Sanella et al, who may be familiar to Doug from his time in the frozen lands of the north (edinburgh): http://homepages.inf.ed.ac.uk/dts/eml/ Some of the bits of spec in this language can be executed, as long as you don't start declaring things about state over time. Again, though, it's hard work, unless your target language is, say ML or Haskell, as there you can jump from Specification to Implementation fairly rapidly Where the formal stuff is good for is things like consistency protocols, so I'd hope someone did get out the proofs for Zookeeper, so the rest of us can rely on it working. -Steve +
Steve Loughran 2011-05-16, 10:50
-
Re: Defining Hadoop Compatibility -revisiting-Steve Loughran 2011-05-12, 09:23
On 11/05/2011 22:24, Eric Baldeschwieler wrote:
> This is a really interesting topic! I completely agree that we need to get ahead of this. > > I would be really interested in learning of any experience other apache projects, such as apache or tomcat have with these issues. I don't know about apache httpd Tomcat is the JCP reference implementation of JSP, the JSP Jar is broadly reused, and the JCP program defines a test kit (with licensing T&Cs) to define compatibility. That is because the JCP program was designed to split specification from implementation. Hadoop doesn't have that, which is a strength and a weakness. Strong: agility. Weakness: compatibility between versions as well as with others. I think Sun NFS might be a good example of similar defacto standard, or MS SMB -it is up to others to show they are compatible with what is effective the reference implementation. Being closed source, there is no option for anyone to include SunOS NFS or MS SMB in their products -the issue of "how much of SunOS NFS to include before you have to stop calling it that" never arose. +
Steve Loughran 2011-05-12, 09:23
-
Re: Defining Hadoop Compatibility -revisiting-Allen Wittenauer 2011-05-12, 16:45
On May 12, 2011, at 2:23 AM, Steve Loughran wrote: > I think Sun NFS might be a good example of similar defacto standard, or MS SMB -it is up to others to show they are compatible with what is effective the reference implementation. Being closed source, there is no option for anyone to include SunOS NFS or MS SMB in their products -the issue of "how much of SunOS NFS to include before you have to stop calling it that" never arose. SMB and FAT are better examples given how much of NFS and associated protocols are actually defined in RFC's (1094 being the first one). :) FAT compatibility between devices is notoriously bad if you go beyond simple files. +
Allen Wittenauer 2011-05-12, 16:45
-
Re: Defining Hadoop Compatibility -revisiting-Doug Cutting 2011-05-13, 06:16
Certification semms like mission creep. Our mission is to produce
open-source software. If we wish to produce testing software, that seems fine. But running a certification program for non-open-source software seems like a different task. The Hadoop mark should only be used to refer to open-source software produced by the ASF. If other folks wish to make factual statements concerning our software, e.g., that their proprietary software passes tests that we've created, that may be fine, but I don't think we should validate those claims by granting certifications to institutions. That ventures outside the mission of the ASF. We are not an accrediting organization. Doug On 05/10/2011 12:29 PM, Steve Loughran wrote: > > Back in Jan 2011, I started a discussion about how to define Apache > Hadoop Compatibility: > http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%[EMAIL PROTECTED]%3E > > > I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet > > http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_1.pdf > > > It claims that their implementations are 100% compatible, even though > the Enterprise edition uses a C filesystem. It also claims that both > their software releases contain "Certified Stacks", without defining > what Certified means, or who does the certification -only that it is an > improvement. > > > I think we should revisit this issue before people with their own > agendas define what compatibility with Apache Hadoop is for us > > > Licensing > -Use of the Hadoop codebase must follow the Apache License > http://www.apache.org/licenses/LICENSE-2.0 > -plug in components that are dynamically linked to (Filesystems and > schedulers) don't appear to be derivative works on my reading of this, > > Naming > -this is something for branding@apache, they will have their opinions. > The key one is that the name "Apache Hadoop" must get used, and it's > important to make clear it is a derivative work. > -I don't think you can claim to have a Distribution/Fork/Version of > Apache Hadoop if you swap out big chunks of it for alternate > filesystems, MR engines, etc. Some description of this is needed > "Supports the Apache Hadoop MapReduce engine on top of Filesystem XYZ" > > Compatibility > -the definition of the Hadoop interfaces and classes is the Apache > Source tree, > -the definition of semantics of the Hadoop interfaces and classes is > the Apache Source tree, including the test classes. > -the verification that the actual semantics of an Apache Hadoop release > is compatible with the expected semantics is that current and future > tests pass > -bug reports can highlight incompatibility with expectations of > community users, and once incorporated into tests form part of the > compatibility testing > -vendors can claim and even certify their derivative works as > compatible with other versions of their derivative works, but cannot > claim compatibility with Apache Hadoop unless their code passes the > tests and is consistent with the bug reports marked as ("by design"). > Perhaps we should have tests that verify each of these "by design" > bugreps to make them more formal. > > Certification > -I have no idea what this means in EMC's case, they just say "Certified" > -As we don't do any certification ourselves, it would seem impossible > for us to certify that any derivative work is compatible. > -It may be best to state that nobody can certify their derivative as > "compatible with Apache Hadoop" unless it passes all current test suites > -And require that anyone who declares compatibility define what they > mean by this > > This is a good argument for getting more functional tests out there > -whoever has more functional tests needs to get them into a test module > that can be used to test real deployments. > +
Doug Cutting 2011-05-13, 06:16
-
Re: Defining Hadoop Compatibility -revisiting-Milind Bhandarkar 2011-05-13, 07:24
+1.
Apache foundation or contributors to Apache should not waste their energy providing such certification. Compatibility claims should be easily verifiable by users of these proprietary systems or independent observers, if a test-suite were readily available to run. >The Hadoop mark should only be used to refer to open-source software >produced by the ASF. IANAL, but Steve is questioning usage of "Apache Hadoop Compatible" in PR material of commercial software. Is this considered as usage of "The Hadoop mark" ? - milind -- Milind Bhandarkar [EMAIL PROTECTED] +1-650-776-3167 On 5/12/11 11:16 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote: >Certification semms like mission creep. Our mission is to produce >open-source software. If we wish to produce testing software, that >seems fine. But running a certification program for non-open-source >software seems like a different task. > >The Hadoop mark should only be used to refer to open-source software >produced by the ASF. If other folks wish to make factual statements >concerning our software, e.g., that their proprietary software passes >tests that we've created, that may be fine, but I don't think we should >validate those claims by granting certifications to institutions. That >ventures outside the mission of the ASF. We are not an accrediting >organization. > >Doug > >On 05/10/2011 12:29 PM, Steve Loughran wrote: >> >> Back in Jan 2011, I started a discussion about how to define Apache >> Hadoop Compatibility: >> >>http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%3C4D >>[EMAIL PROTECTED]%3E >> >> >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet >> >> >>http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_1. >> >> >> It claims that their implementations are 100% compatible, even though >> the Enterprise edition uses a C filesystem. It also claims that both >> their software releases contain "Certified Stacks", without defining >> what Certified means, or who does the certification -only that it is an >> improvement. >> >> >> I think we should revisit this issue before people with their own >> agendas define what compatibility with Apache Hadoop is for us >> >> >> Licensing >> -Use of the Hadoop codebase must follow the Apache License >> http://www.apache.org/licenses/LICENSE-2.0 >> -plug in components that are dynamically linked to (Filesystems and >> schedulers) don't appear to be derivative works on my reading of this, >> >> Naming >> -this is something for branding@apache, they will have their opinions. >> The key one is that the name "Apache Hadoop" must get used, and it's >> important to make clear it is a derivative work. >> -I don't think you can claim to have a Distribution/Fork/Version of >> Apache Hadoop if you swap out big chunks of it for alternate >> filesystems, MR engines, etc. Some description of this is needed >> "Supports the Apache Hadoop MapReduce engine on top of Filesystem XYZ" >> >> Compatibility >> -the definition of the Hadoop interfaces and classes is the Apache >> Source tree, >> -the definition of semantics of the Hadoop interfaces and classes is >> the Apache Source tree, including the test classes. >> -the verification that the actual semantics of an Apache Hadoop release >> is compatible with the expected semantics is that current and future >> tests pass >> -bug reports can highlight incompatibility with expectations of >> community users, and once incorporated into tests form part of the >> compatibility testing >> -vendors can claim and even certify their derivative works as >> compatible with other versions of their derivative works, but cannot >> claim compatibility with Apache Hadoop unless their code passes the >> tests and is consistent with the bug reports marked as ("by design"). >> Perhaps we should have tests that verify each of these "by design" >> bugreps to make them more formal. >> >> Certification >> -I have no idea what this means in EMC's case, they just say +
Milind Bhandarkar 2011-05-13, 07:24
-
Re: Defining Hadoop Compatibility -revisiting-Doug Cutting 2011-05-13, 08:53
On 05/13/2011 09:24 AM, Milind Bhandarkar wrote:
> IANAL, but Steve is questioning usage of "Apache Hadoop Compatible" in PR > material of commercial software. Is this considered as usage of "The > Hadoop mark" ? It's a usage. Is it permitted? Let's consider. "EMC Greenplum HD Enterprise Edition - The Enterprise Edition is a 100 percent interface-compatible implementation of the Apache Hadoop stack." The trademark question is whether this creates confusion about what Hadoop means (not acceptable) or whether it's just a statement about Hadoop (acceptable). On one hand, it might be read to say, "EE is Hadoop", which creates confusion, on the other it might be read to say, "EE's API is a superset of Hadoop's API", a statement of fact that may or may not be true. I think the intended reading is the latter, but it should probably be stated more clearly: Hadoop has an API but Hadoop is not its API. Perhaps we should to ask them to clarify this? "EMC Greenplum HD Community Edition - The Community Edition is a 100 percent open source certified and supported version of the Apache Hadoop stack comprising HDFS, MapReduce, Zookeeper, Hive and HBase." Here "certified" is probably just intended to mean that the software uses a "certified" open source license, e.g., listed at http://www.opensource.org/licenses/. However they should say that this "includes" or "contains" the various Apache products, not that it "is" them. Doug +
Doug Cutting 2011-05-13, 08:53
-
Re: Defining Hadoop Compatibility -revisiting-Ted Dunning 2011-05-13, 13:43
I thought the word "comprising" meant includes, not is.
On Fri, May 13, 2011 at 1:53 AM, Doug Cutting <[EMAIL PROTECTED]> wrote: > "EMC Greenplum HD Community Edition - The Community Edition is a 100 > percent open source certified and supported version of the Apache Hadoop > stack comprising HDFS, MapReduce, Zookeeper, Hive and HBase." > > Here "certified" is probably just intended to mean that the software > uses a "certified" open source license, e.g., listed at > http://www.opensource.org/licenses/. However they should say that this > "includes" or "contains" the various Apache products, not that it "is" > them. > +
Ted Dunning 2011-05-13, 13:43
-
Re: Defining Hadoop Compatibility -revisiting-Doug Cutting 2011-05-13, 14:50
Yes, but there's an "is" earlier in the sentence.
Doug On May 13, 2011 3:44 PM, "Ted Dunning" <[EMAIL PROTECTED]> wrote: > I thought the word "comprising" meant includes, not is. > > On Fri, May 13, 2011 at 1:53 AM, Doug Cutting <[EMAIL PROTECTED]> wrote: > >> "EMC Greenplum HD Community Edition - The Community Edition is a 100 >> percent open source certified and supported version of the Apache Hadoop >> stack comprising HDFS, MapReduce, Zookeeper, Hive and HBase." >> >> Here "certified" is probably just intended to mean that the software >> uses a "certified" open source license, e.g., listed at >> http://www.opensource.org/licenses/. However they should say that this >> "includes" or "contains" the various Apache products, not that it "is" >> them. >> +
Doug Cutting 2011-05-13, 14:50
-
Re: Defining Hadoop Compatibility -revisiting-Nathan Roberts 2011-05-13, 15:19
Key seems to be how one would interpret "version". Replace it with a synonym
like "variant" and this may be the intent. On 5/13/11 9:50 AM, "Doug Cutting" <[EMAIL PROTECTED]> wrote: > Yes, but there's an "is" earlier in the sentence. > > Doug > On May 13, 2011 3:44 PM, "Ted Dunning" <[EMAIL PROTECTED]> wrote: >> I thought the word "comprising" meant includes, not is. >> >> On Fri, May 13, 2011 at 1:53 AM, Doug Cutting <[EMAIL PROTECTED]> wrote: >> >>> "EMC Greenplum HD Community Edition - The Community Edition is a 100 >>> percent open source certified and supported version of the Apache Hadoop >>> stack comprising HDFS, MapReduce, Zookeeper, Hive and HBase." >>> >>> Here "certified" is probably just intended to mean that the software >>> uses a "certified" open source license, e.g., listed at >>> http://www.opensource.org/licenses/. However they should say that this >>> "includes" or "contains" the various Apache products, not that it "is" >>> them. >>> > +
Nathan Roberts 2011-05-13, 15:19
-
Re: Defining Hadoop Compatibility -revisiting-Allen Wittenauer 2011-05-13, 17:28
On May 13, 2011, at 1:53 AM, Doug Cutting wrote: > Here "certified" is probably just intended to mean that the software > uses a "certified" open source license, e.g., listed at > http://www.opensource.org/licenses/. However they should say that this > "includes" or "contains" the various Apache products, not that it "is" them. If it has a modified version of Hadoop (i.e., not an actual Apache release or patches which have never been committed to trunk), are they allowed to say "includes Apache Hadoop"? At what point is it not Apache Hadoop? +
Allen Wittenauer 2011-05-13, 17:28
-
Re: Defining Hadoop Compatibility -revisiting-Segel, Mike 2011-05-13, 17:32
My first read was that they used the term Apache Hadoop in reference of Apache's release. They referenced their release as Hadoop.
Sent from a remote device. Please excuse any typos... Mike Segel On May 13, 2011, at 12:28 PM, "Allen Wittenauer" <[EMAIL PROTECTED]> wrote: > > On May 13, 2011, at 1:53 AM, Doug Cutting wrote: >> Here "certified" is probably just intended to mean that the software >> uses a "certified" open source license, e.g., listed at >> http://www.opensource.org/licenses/. However they should say that this >> "includes" or "contains" the various Apache products, not that it "is" them. > > If it has a modified version of Hadoop (i.e., not an actual Apache release or patches which have never been committed to trunk), are they allowed to say "includes Apache Hadoop"? At what point is it not Apache Hadoop? > The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files. +
Segel, Mike 2011-05-13, 17:32
-
Re: Defining Hadoop Compatibility -revisiting-Doug Cutting 2011-05-13, 21:55
On 05/13/2011 07:28 PM, Allen Wittenauer wrote:
> If it has a modified version of Hadoop (i.e., not an actual Apache > release or patches which have never been committed to trunk), are > they allowed to say "includes Apache Hadoop"? No. Those are the two cases we permit. We used to say that it was enough for a patch to be in Jira, but Roy clarified last year that committed to trunk is a better line, since that means the code has been reviewed and accepted by the community. Doug +
Doug Cutting 2011-05-13, 21:55
-
Re: Defining Hadoop Compatibility -revisiting-Allen Wittenauer 2011-05-13, 22:13
On May 13, 2011, at 2:55 PM, Doug Cutting wrote: > On 05/13/2011 07:28 PM, Allen Wittenauer wrote: >> If it has a modified version of Hadoop (i.e., not an actual Apache >> release or patches which have never been committed to trunk), are >> they allowed to say "includes Apache Hadoop"? > > No. Those are the two cases we permit. We used to say that it was > enough for a patch to be in Jira, but Roy clarified last year that > committed to trunk is a better line, since that means the code has been > reviewed and accepted by the community. So what do we do about companies that release a product that says "includes Apache Hadoop" but includes patches that aren't committed to trunk? +
Allen Wittenauer 2011-05-13, 22:13
-
Re: Defining Hadoop Compatibility -revisiting-Doug Cutting 2011-05-13, 22:16
On 05/14/2011 12:13 AM, Allen Wittenauer wrote:
> So what do we do about companies that release a product that says "includes Apache Hadoop" but includes patches that aren't committed to trunk? We yell at them to get those patches into trunk already. This policy was clarified after that product was shipping. Doug +
Doug Cutting 2011-05-13, 22:16
-
Re: Defining Hadoop Compatibility -revisiting-Allen Wittenauer 2011-05-13, 22:17
On May 13, 2011, at 3:16 PM, Doug Cutting wrote: > On 05/14/2011 12:13 AM, Allen Wittenauer wrote: >> So what do we do about companies that release a product that says "includes Apache Hadoop" but includes patches that aren't committed to trunk? > > We yell at them to get those patches into trunk already. This policy > was clarified after that product was shipping. ... and if those patches are rejected by the community? +
Allen Wittenauer 2011-05-13, 22:17
-
Re: Defining Hadoop Compatibility -revisiting-Doug Cutting 2011-05-13, 22:22
On 05/14/2011 12:17 AM, Allen Wittenauer wrote:
> ... and if those patches are rejected by the community? It would be very strange, since they've mostly been released in 203, although not yet having been committed to trunk. Doug +
Doug Cutting 2011-05-13, 22:22
-
Re: Defining Hadoop Compatibility -revisiting-Steve Loughran 2011-05-16, 11:15
On 13/05/11 23:16, Doug Cutting wrote:
> On 05/14/2011 12:13 AM, Allen Wittenauer wrote: >> So what do we do about companies that release a product that says "includes Apache Hadoop" but includes patches that aren't committed to trunk? > > We yell at them to get those patches into trunk already. This policy > was clarified after that product was shipping. > > Doug I distributed some RPMs with my lifecycle branch in, I can't remember what I called them, but I'd better revist all my .spec files to make sure the text is valid. Even with 0.21 JARs, what should I call it? sf-apache-hadoop-operations.rpm "This RPM contains the JAR artifacts of Apache Hadoop 0.21 and SmartFrog components to manage hadoop clusters, manipulate the distributed filesystems, and submit MapReduce jobs" Would that work? +
Steve Loughran 2011-05-16, 11:15
-
Re: Defining Hadoop Compatibility -revisiting-Eli Collins 2011-05-13, 22:18
On Fri, May 13, 2011 at 3:13 PM, Allen Wittenauer <[EMAIL PROTECTED]> wrote:
> > On May 13, 2011, at 2:55 PM, Doug Cutting wrote: > >> On 05/13/2011 07:28 PM, Allen Wittenauer wrote: >>> If it has a modified version of Hadoop (i.e., not an actual Apache >>> release or patches which have never been committed to trunk), are >>> they allowed to say "includes Apache Hadoop"? >> >> No. Those are the two cases we permit. We used to say that it was >> enough for a patch to be in Jira, but Roy clarified last year that >> committed to trunk is a better line, since that means the code has been >> reviewed and accepted by the community. > > > So what do we do about companies that release a product that says "includes Apache Hadoop" but includes patches that aren't committed to trunk? > It's not just companies by the way, we the Hadoop project just made an official Apache release that contains patches not yet in trunk... +
Eli Collins 2011-05-13, 22:18
-
Re: Defining Hadoop Compatibility -revisiting-Ted Dunning 2011-05-13, 22:53
But "distribution Z includes X" kind of implies the existence of some such
that X != Y, Y != empty-set and X+Y = Z, at least in common usage. Isn't that the same as a non-trunk change? So doesn't this mean that your question reduces to the question of what happens when non-Apache changes are made to an Apache release? And isn't that the definition of a derived work? On Fri, May 13, 2011 at 3:13 PM, Allen Wittenauer <[EMAIL PROTECTED]> wrote: > > On May 13, 2011, at 2:55 PM, Doug Cutting wrote: > > > On 05/13/2011 07:28 PM, Allen Wittenauer wrote: > >> If it has a modified version of Hadoop (i.e., not an actual Apache > >> release or patches which have never been committed to trunk), are > >> they allowed to say "includes Apache Hadoop"? > > > > No. Those are the two cases we permit. We used to say that it was > > enough for a patch to be in Jira, but Roy clarified last year that > > committed to trunk is a better line, since that means the code has been > > reviewed and accepted by the community. > > > So what do we do about companies that release a product that says > "includes Apache Hadoop" but includes patches that aren't committed to > trunk? > > > +
Ted Dunning 2011-05-13, 22:53
-
Re: Defining Hadoop Compatibility -revisiting-Allen Wittenauer 2011-05-13, 22:57
On May 13, 2011, at 3:53 PM, Ted Dunning wrote: > But "distribution Z includes X" kind of implies the existence of some such > that X != Y, Y != empty-set and X+Y = Z, at least in common usage. > > Isn't that the same as a non-trunk change? > > So doesn't this mean that your question reduces to the question of what > happens when non-Apache changes are made to an Apache release? And isn't > that the definition of a derived work? Yup. Which is why I doubt *any* commercial entity can claim "includes Apache Hadoop" (including Cloudera). +
Allen Wittenauer 2011-05-13, 22:57
-
Re: Defining Hadoop Compatibility -revisiting-Steve Loughran 2011-05-16, 11:01
On 13/05/11 23:57, Allen Wittenauer wrote:
> > On May 13, 2011, at 3:53 PM, Ted Dunning wrote: > >> But "distribution Z includes X" kind of implies the existence of some such >> that X != Y, Y != empty-set and X+Y = Z, at least in common usage. >> >> Isn't that the same as a non-trunk change? >> >> So doesn't this mean that your question reduces to the question of what >> happens when non-Apache changes are made to an Apache release? And isn't >> that the definition of a derived work? > > > Yup. Which is why I doubt *any* commercial entity can claim "includes Apache Hadoop" (including Cloudera). > > but they can claim it is a derivative work, which CDH clearly is, (Though if we were to come up with a formal declaration of what a derivative work is, we'd have to handle the fact that it is a superset. Even worse, you may realise a release is the ordered application of a sequence of patches, and if the patches are applied in a different order you may end up with a different body of source code...) Something that implements the APIs may not be a derivative work, depending on how much of the original code is in there. You could look at the base classes and interfaces and produce a clean room implementation (relying on the notion that interfaces are a list of facts and not copyrightable in the US), but whoever does that may encounter the issue that Google's donation of the right to use their MR patent may not apply to such implementations. +
Steve Loughran 2011-05-16, 11:01
-
Re: Defining Hadoop Compatibility -revisiting-Segel, Mike 2011-05-16, 12:00
But Cloudera's release is a bit murky.
The math example is a bit flawed... X represents the set of stable releases. Y represents the set of available patches. C represents the set of Cloudera releases. So if C contains a release X(n) plus a set of patches that is contained in Y, Then does it not have the right to be considered Apache Hadoop? It's my understanding is that any enhancement to Hadoop is made available to Apache and will eventually make it into a later release... So while it may not be 'official' release X(z), all of it's components are in Apache. (note: I'm talking about the core components and not Cloudera's additional toolsets that encompass Hadoop.) Cloudera is clearly a derivative work. And IMHO is the only one which can say ... 'Includes Apache Hadoop'. That doesn't mean that others can't, depending on how they implemented their changes. Based on EMC marketing material, they've done a rip and replace of HDFS. So it wouldn't be a superset since it doesn't contain a complete subset, but contains code that implements the API... So they can't say 'Includes Apache Hadoop',but they can say it's a derivative work based on Apache Hadoop and then go on to show how and why, in their opinion their product is better.(that's marketing for you...) Clearly there are others out there... Hadoop on Cassandra as an example... Fragmentation of Hadoop will occur. It's inevitable. Too much money is on the table... But because Apache's licensing is so open, Apache will have a hard time controlling derivative works... I believe that Steve is incorrect in his assertion concerning potential loss of any patent protection. Again Apache's licensing is very open and as long as they follow Apache's Ts and Cs, they are covered. Note: because I am sending this from my email address at my client, I am obliged to say that this email is my opinion and does not reflect on the opinion of my client... (you know the rest....) Sent from a remote device. Please excuse any typos... Mike Segel On May 16, 2011, at 6:02 AM, "Steve Loughran" <[EMAIL PROTECTED]> wrote: > On 13/05/11 23:57, Allen Wittenauer wrote: >> >> On May 13, 2011, at 3:53 PM, Ted Dunning wrote: >> >>> But "distribution Z includes X" kind of implies the existence of some such >>> that X != Y, Y != empty-set and X+Y = Z, at least in common usage. >>> >>> Isn't that the same as a non-trunk change? >>> >>> So doesn't this mean that your question reduces to the question of what >>> happens when non-Apache changes are made to an Apache release? And isn't >>> that the definition of a derived work? >> >> >> Yup. Which is why I doubt *any* commercial entity can claim "includes Apache Hadoop" (including Cloudera). >> >> > > but they can claim it is a derivative work, which CDH clearly is, > (Though if we were to come up with a formal declaration of what a > derivative work is, we'd have to handle the fact that it is a superset. > Even worse, you may realise a release is the ordered application of a > sequence of patches, and if the patches are applied in a different order > you may end up with a different body of source code...) > > Something that implements the APIs may not be a derivative work, > depending on how much of the original code is in there. You could look > at the base classes and interfaces and produce a clean room > implementation (relying on the notion that interfaces are a list of > facts and not copyrightable in the US), but whoever does that may > encounter the issue that Google's donation of the right to use their MR > patent may not apply to such implementations. The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files. +
Segel, Mike 2011-05-16, 12:00
-
Re: Defining Hadoop Compatibility -revisiting-Steve Loughran 2011-05-16, 14:11
On 16/05/11 13:00, Segel, Mike wrote:
> But Cloudera's release is a bit murky. > > The math example is a bit flawed... > > X represents the set of stable releases. > Y represents the set of available patches. > C represents the set of Cloudera releases. > > So if C contains a release X(n) plus a set of patches that is contained in Y, > Then does it not have the right to be considered Apache Hadoop? > It's my understanding is that any enhancement to Hadoop is made available to Apache and will eventually make it into a later release... It certainly contains it. Now, if you want to make life more complex: -view the contributions to the code base as a series of patches P1...Pn, each of which changes the code. -These patches are essentially functions that transform the source S to a new state S'. -the initial state of the source codebase is S0. Hypothesis: the order in which the patch functions are applied determines the final state of the source tree. If patches P1 and P2 were applied in order, you would get a state S' = P2(P1(S0)) Applying the patches in a different order, you get a new final state. S'' = P1(P2(S0)) Question for the maths people then is: can you be sure that S' and S'' are the same. As it would seem to me that it depends on the nature of the function. It could be that the set of functions that SVN supports guarantees sameness, but given conflict resolution problems I've encountered in the past, I doubt this. Assuming that my belief holds: that the order in which a series of SVN patches are executed determines the final state of the source tree, then saying the patch sets -the set of functions applied to the source- of two codebases are equivalent does not mean the final state of the code is the same unless the sequence of application is also the same. That would then define an apache release as a strictly ordered sequence of patches, or at least an sequence of operations that leads to the same final code state, such as S0.20.3 (oh look, I've just written a formal definition of what a release is, though I've avoided defining what a function is. View them as planar projections in cartesian space or something) > > So while it may not be 'official' release X(z), all of it's components are in Apache. > (note: I'm talking about the core components and not Cloudera's additional toolsets that encompass Hadoop.) > > Cloudera is clearly a derivative work. > And IMHO is the only one which can say ... 'Includes Apache Hadoop'. Once you start thinking about the ordering of the patch functions it gets complicated. > That doesn't mean that others can't, depending on how they implemented their changes. yes, though again it depends on the sequence of functions applied to the released sourcecode, such as S0.20.3, to the version they ship. > So it wouldn't be a superset since it doesn't contain a complete subset, but contains code that implements the API... So they can't say 'Includes Apache Hadoop',but they can say it's a derivative work based on Apache Hadoop and then go on to show how and why, in their opinion their product is better.(that's marketing for you...) I agree > Fragmentation of Hadoop will occur. It's inevitable. Too much money is on the table... Clearly, but there are still some questions we can resolve here -what do they call their products? -how can they support assertions that their code is compatible if the series of patches they have applied to the codebase are not externally visible? -what are the concerns of the community about naming and branching? > > But because Apache's licensing is so open, Apache will have a hard time controlling derivative works... The Apache license permits anyone to fork and take that fork in house or closed source. Most people are considered daft to do this except for quick fixes, because any closed source takes on the task of writing the functions needed to transform it from the released state to one that matches customer needs. (i.e. the working state) Possibly. I avoid such legal issues. -steve +
Steve Loughran 2011-05-16, 14:11
-
Re: Defining Hadoop Compatibility -revisiting-Allen Wittenauer 2011-05-16, 17:19
On May 16, 2011, at 5:00 AM, Segel, Mike wrote: > X represents the set of stable releases. > Y represents the set of available patches. > C represents the set of Cloudera releases. > > So if C contains a release X(n) plus a set of patches that is contained in Y, > Then does it not have the right to be considered Apache Hadoop? > It's my understanding is that any enhancement to Hadoop is made available to Apache and will eventually make it into a later release... This assumption is probably wrong. It likely wouldn't be hard to find patches made in Cloudera Hadoop that have been rejected from Apache Hadoop. I know some of the code in Cloudera Hadoop 2 was definitely rejected. If Cloudera Hadoop 3's lineage is based upon 2... +
Allen Wittenauer 2011-05-16, 17:19
-
Re: Defining Hadoop Compatibility -revisiting-Eli Collins 2011-05-16, 21:09
On Mon, May 16, 2011 at 10:19 AM, Allen Wittenauer <[EMAIL PROTECTED]> wrote:
> > On May 16, 2011, at 5:00 AM, Segel, Mike wrote: >> X represents the set of stable releases. >> Y represents the set of available patches. >> C represents the set of Cloudera releases. >> >> So if C contains a release X(n) plus a set of patches that is contained in Y, >> Then does it not have the right to be considered Apache Hadoop? >> It's my understanding is that any enhancement to Hadoop is made available to Apache and will eventually make it into a later release... > > This assumption is probably wrong. It likely wouldn't be hard to find patches made in Cloudera Hadoop that have been rejected from Apache Hadoop. I know some of the code in Cloudera Hadoop 2 was definitely rejected. If Cloudera Hadoop 3's lineage is based upon 2... Allen, There are few things in Hadoop in CDH that are not in trunk, branch-20-security, or branch-20-append. The stuff in this category is not major (eg HADOOP-6605, better JAVA_HOME detection). One of the things we and others are busy doing is getting the work from CDH3 and 20x (formerly YDH) checked into trunk so a future release won't regress against these 20-based releases. Most projects in CDH are not heavily patched btw, they're close to an upstream Apache release. Hadoop is the exception. https://ccp.cloudera.com/display/DOC/Downloading+CDH+Releases Thanks, Eli +
Eli Collins 2011-05-16, 21:09
-
Re: Defining Hadoop Compatibility -revisiting-Allen Wittenauer 2011-05-16, 21:25
On May 16, 2011, at 2:09 PM, Eli Collins wrote: > > Allen, > > There are few things in Hadoop in CDH that are not in trunk, > branch-20-security, or branch-20-append. The stuff in this category > is not major (eg HADOOP-6605, better JAVA_HOME detection). But that's my point: when is it no longer Apache Hadoop? How major does a change need to be under the line? In the case of CDH2 and 3, in order to test it out, I actually had to back out some of Cloudera's "improvements" in order to even test whereas I didn't under Apache. Is this another place where we only seem to care about APIs and say to hell with the rest of the stack? +
Allen Wittenauer 2011-05-16, 21:25
-
Re: Defining Hadoop Compatibility -revisiting-Eli Collins 2011-05-16, 21:29
On Mon, May 16, 2011 at 2:25 PM, Allen Wittenauer <[EMAIL PROTECTED]> wrote:
> > On May 16, 2011, at 2:09 PM, Eli Collins wrote: >> >> Allen, >> >> There are few things in Hadoop in CDH that are not in trunk, >> branch-20-security, or branch-20-append. The stuff in this category >> is not major (eg HADOOP-6605, better JAVA_HOME detection). > > But that's my point: when is it no longer Apache Hadoop? How major does a change need to be under the line? In the case of CDH2 and 3, in order to test it out, I actually had to back out some of Cloudera's "improvements" in order to even test whereas I didn't under Apache. Is this another place where we only seem to care about APIs and say to hell with the rest of the stack? > I don't think anyone is saying to hell with the rest of the stack, and everyone I've spoken to is on-board with a future release that doesn't require lots of backporting from feature branches. Thanks, Eli +
Eli Collins 2011-05-16, 21:29
-
Re: Defining Hadoop Compatibility -revisiting-Allen Wittenauer 2011-05-16, 21:42
On May 16, 2011, at 2:29 PM, Eli Collins wrote: > On Mon, May 16, 2011 at 2:25 PM, Allen Wittenauer <[EMAIL PROTECTED]> wrote: >> >> On May 16, 2011, at 2:09 PM, Eli Collins wrote: >>> >>> Allen, >>> >>> There are few things in Hadoop in CDH that are not in trunk, >>> branch-20-security, or branch-20-append. The stuff in this category >>> is not major (eg HADOOP-6605, better JAVA_HOME detection). >> >> But that's my point: when is it no longer Apache Hadoop? How major does a change need to be under the line? In the case of CDH2 and 3, in order to test it out, I actually had to back out some of Cloudera's "improvements" in order to even test whereas I didn't under Apache. Is this another place where we only seem to care about APIs and say to hell with the rest of the stack? >> > > I don't think anyone is saying to hell with the rest of the stack, and > everyone I've spoken to is on-board with a future release that > doesn't require lots of backporting from feature branches. You've missed my point. Does "Hadoop compatibility" and the ability to say "includes Apache Hadoop" only apply when we're talking about MR and HDFS APIs? +
Allen Wittenauer 2011-05-16, 21:42
-
Re: Defining Hadoop Compatibility -revisiting-Ian Holsman 2011-05-16, 21:59
> > Does "Hadoop compatibility" and the ability to say "includes Apache Hadoop" only apply when we're talking about MR and HDFS APIs? It is confusing isn't it. We could go down the route java did and say that the API's are 'hadoop' and ours is just a reference implementation of it. (but others pointed out, we don't want to become a certification group) Out of curiosity, how good is our test suite in exercising our APIs? Is it sophisticated enough to capture someone adding a functionality-changing patch (eg the append one). and have it flag it as a test-failure? +
Ian Holsman 2011-05-16, 21:59
-
Re: Defining Hadoop Compatibility -revisiting-Konstantin Boudnik 2011-05-17, 01:52
We have the following method coverage:
Common ~60% HDFS ~80% MR ~70% (better analysis will be available after our projects are connected to Sonar, I think). While method coverage isn't completely adequate answer to your question, I'd say there is a possibility to sneak in some semantical and even API changes which might go entirely unvalidated by the test suites. It isn't very high, but it does exist. A better approach to validate semantics is to run cluster tests (e.g. system tests) which have a better potentials to exercise public APIs than functional tests. There's HADOOP-7278 to address this for 0.22 (and potentially others) -- Take care, Konstantin (Cos) Boudnik Disclaimer: Opinions expressed in this email are those of the author, and do not necessarily represent the views of any company the author might be affiliated with at the moment of writing. On Mon, May 16, 2011 at 14:59, Ian Holsman <[EMAIL PROTECTED]> wrote: > >> >> Does "Hadoop compatibility" and the ability to say "includes Apache Hadoop" only apply when we're talking about MR and HDFS APIs? > > > It is confusing isn't it. > > We could go down the route java did and say that the API's are 'hadoop' and ours is just a reference implementation of it. (but others pointed out, we don't want to become a certification group) > > Out of curiosity, how good is our test suite in exercising our APIs? > Is it sophisticated enough to capture someone adding a functionality-changing patch (eg the append one). and have it flag it as a test-failure? > > +
Konstantin Boudnik 2011-05-17, 01:52
-
Re: Defining Hadoop Compatibility -revisiting-Matthew Foley 2011-05-16, 21:17
It's important to distinguish between the name "Hadoop", which is protected by trademark law,
and the Hadoop implementation, which is licensed as opensource under copyright law. The term "derivative work" is, I believe, only relevant under copyright law, not trademark law. (N.B., I'm not a lawyer -- and this email is my opinion, not my employer's.) Since the Apache License explicitly allows derivative works, I don't think it's a useful term for this discussion. However, the ASF, and by delegation the Hadoop PMC, has a lot of control over the name, and how we allow it to be used, under trademark law. But to keep our rights under that law, we have to enforce the trademark consistently. So it's good that we're having this discussion, and it's important to reach a conclusion, document it, and enforce it consistently. There are a lot of subtleties; for instance, if I recall correctly from my days with Adobe and PostScript(R), someone who has not licensed a trademark "X" can still claim "compatible with X" as long as they ALSO make clear that the product is NOT, itself, an "X". But you really need a lawyer to get into that stuff. --Matt On May 16, 2011, at 5:00 AM, Segel, Mike wrote: But Cloudera's release is a bit murky. The math example is a bit flawed... X represents the set of stable releases. Y represents the set of available patches. C represents the set of Cloudera releases. So if C contains a release X(n) plus a set of patches that is contained in Y, Then does it not have the right to be considered Apache Hadoop? It's my understanding is that any enhancement to Hadoop is made available to Apache and will eventually make it into a later release... So while it may not be 'official' release X(z), all of it's components are in Apache. (note: I'm talking about the core components and not Cloudera's additional toolsets that encompass Hadoop.) Cloudera is clearly a derivative work. And IMHO is the only one which can say ... 'Includes Apache Hadoop'. That doesn't mean that others can't, depending on how they implemented their changes. Based on EMC marketing material, they've done a rip and replace of HDFS. So it wouldn't be a superset since it doesn't contain a complete subset, but contains code that implements the API... So they can't say 'Includes Apache Hadoop',but they can say it's a derivative work based on Apache Hadoop and then go on to show how and why, in their opinion their product is better.(that's marketing for you...) Clearly there are others out there... Hadoop on Cassandra as an example... Fragmentation of Hadoop will occur. It's inevitable. Too much money is on the table... But because Apache's licensing is so open, Apache will have a hard time controlling derivative works... I believe that Steve is incorrect in his assertion concerning potential loss of any patent protection. Again Apache's licensing is very open and as long as they follow Apache's Ts and Cs, they are covered. Note: because I am sending this from my email address at my client, I am obliged to say that this email is my opinion and does not reflect on the opinion of my client... (you know the rest....) Sent from a remote device. Please excuse any typos... Mike Segel On May 16, 2011, at 6:02 AM, "Steve Loughran" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: On 13/05/11 23:57, Allen Wittenauer wrote: On May 13, 2011, at 3:53 PM, Ted Dunning wrote: But "distribution Z includes X" kind of implies the existence of some such that X != Y, Y != empty-set and X+Y = Z, at least in common usage. Isn't that the same as a non-trunk change? So doesn't this mean that your question reduces to the question of what happens when non-Apache changes are made to an Apache release? And isn't that the definition of a derived work? Yup. Which is why I doubt *any* commercial entity can claim "includes Apache Hadoop" (including Cloudera). but they can claim it is a derivative work, which CDH clearly is, (Though if we were to come up with a formal declaration of what a derivative work is, we'd have to handle the fact that it is a superset. Even worse, you may realise a release is the ordered application of a sequence of patches, and if the patches are applied in a different order you may end up with a different body of source code...) Something that implements the APIs may not be a derivative work, depending on how much of the original code is in there. You could look at the base classes and interfaces and produce a clean room implementation (relying on the notion that interfaces are a list of facts and not copyrightable in the US), but whoever does that may encounter the issue that Google's donation of the right to use their MR patent may not apply to such implementations. The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files. +
Matthew Foley 2011-05-16, 21:17
-
Re: Defining Hadoop Compatibility -revisiting-Segel, Mike 2011-05-17, 00:40
I just checked... TESS said no trademarks for Hadoop.
So... what TM protection? :-) You are correct about derivative works. It's a moot point as long as the derivative work follows the T&Cs... Sent from a remote device. Please excuse any typos... Mike Segel On May 16, 2011, at 4:18 PM, "Matthew Foley" <[EMAIL PROTECTED]> wrote: > It's important to distinguish between the name "Hadoop", which is protected by trademark law, > and the Hadoop implementation, which is licensed as opensource under copyright law. > > The term "derivative work" is, I believe, only relevant under copyright law, not trademark law. > (N.B., I'm not a lawyer -- and this email is my opinion, not my employer's.) Since the Apache License > explicitly allows derivative works, I don't think it's a useful term for this discussion. > > However, the ASF, and by delegation the Hadoop PMC, has a lot of control over the name, > and how we allow it to be used, under trademark law. But to keeps our rights under that > law, we have to enforce the trademark consistently. So it's good that we're having this discussion, > and it's important to reach a conclusion, document it, and enforce it consistently. > > There are a lot of subtleties; for instance, if I recall correctly from my days with Adobe and > PostScript(R), someone who has not licensed a trademark "X" can still claim "compatible with X" > as long as they ALSO make clear that the product is NOT, itself, an "X". But you really need > a lawyer to get into that stuff. > > --Matt > > > On May 16, 2011, at 5:00 AM, Segel, Mike wrote: > > But Cloudera's release is a bit murky. > > The math example is a bit flawed... > > X represents the set of stable releases. > Y represents the set of available patches. > C represents the set of Cloudera releases. > > So if C contains a release X(n) plus a set of patches that is contained in Y, > Then does it not have the right to be considered Apache Hadoop? > It's my understanding is that any enhancement to Hadoop is made available to Apache and will eventually make it into a later release... > > So while it may not be 'official' release X(z), all of it's components are in Apache. > (note: I'm talking about the core components and not Cloudera's additional toolsets that encompass Hadoop.) > > Cloudera is clearly a derivative work. > And IMHO is the only one which can say ... 'Includes Apache Hadoop'. > > That doesn't mean that others can't, depending on how they implemented their changes. > Based on EMC marketing material, they've done a rip and replace of HDFS. > So it wouldn't be a superset since it doesn't contain a complete subset, but contains code that implements the API... So they can't say 'Includes Apache Hadoop',but they can say it's a derivative work based on Apache Hadoop and then go on to show how and why, in their opinion their product is better.(that's marketing for you...) > > Clearly there are others out there... > Hadoop on Cassandra as an example... > > Fragmentation of Hadoop will occur. It's inevitable. Too much money is on the table... > > But because Apache's licensing is so open, Apache will have a hard time controlling derivative works... > I believe that Steve is incorrect in his assertion concerning potential loss of any patent protection. Again Apache's licensing is very open and as long as they follow Apache's Ts and Cs, they are covered. > > Note: because I am sending this from my email address at my client, I am obliged to say that this email is my opinion and does not reflect on the opinion of my client... > (you know the rest....) > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On May 16, 2011, at 6:02 AM, "Steve Loughran" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > > On 13/05/11 23:57, Allen Wittenauer wrote: > > On May 13, 2011, at 3:53 PM, Ted Dunning wrote: > > But "distribution Z includes X" kind of implies the existence of some such > that X != Y, Y != empty-set and X+Y = Z, at least in common usage. The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files. +
Segel, Mike 2011-05-17, 00:40
-
Re: Defining Hadoop Compatibility -revisiting-Scott Carey 2011-05-17, 01:12
On trademarks, what about the phrase: "New distribution for Apache
Hadoop"? I've seen that used, and its something that replaces most of the stack. I believe "Apache Hadoop" is trademarked in this context, even if Hadoop alone isn't. "Compatible with Apache Hadoop" is a smaller issue, defining some rough guidelines for various forms of compatibility is useful for the community (and reputable vendors), abuse of that will at least become obvious. But "distribution for Apache Hadoop" (not too sure what 'for' means here)? Is there any TM protection? A proprietary derivative work with most of the guts replaced is not an Apache Hadoop distribution, nor a distribution for Apache Hadoop. On 5/16/11 5:40 PM, "Segel, Mike" <[EMAIL PROTECTED]> wrote: >I just checked... TESS said no trademarks for Hadoop. >So... what TM protection? :-) > >You are correct about derivative works. It's a moot point as long as the >derivative work follows the T&Cs... > > > >Sent from a remote device. Please excuse any typos... > >Mike Segel > >On May 16, 2011, at 4:18 PM, "Matthew Foley" <[EMAIL PROTECTED]> wrote: > >> It's important to distinguish between the name "Hadoop", which is >>protected by trademark law, >> and the Hadoop implementation, which is licensed as opensource under >>copyright law. >> >> The term "derivative work" is, I believe, only relevant under copyright >>law, not trademark law. >> (N.B., I'm not a lawyer -- and this email is my opinion, not my >>employer's.) Since the Apache License >> explicitly allows derivative works, I don't think it's a useful term >>for this discussion. >> >> However, the ASF, and by delegation the Hadoop PMC, has a lot of >>control over the name, >> and how we allow it to be used, under trademark law. But to keeps our >>rights under that >> law, we have to enforce the trademark consistently. So it's good that >>we're having this discussion, >> and it's important to reach a conclusion, document it, and enforce it >>consistently. >> >> There are a lot of subtleties; for instance, if I recall correctly from >>my days with Adobe and >> PostScript(R), someone who has not licensed a trademark "X" can still >>claim "compatible with X" >> as long as they ALSO make clear that the product is NOT, itself, an >>"X". But you really need >> a lawyer to get into that stuff. >> >> --Matt >> >> >> On May 16, 2011, at 5:00 AM, Segel, Mike wrote: >> >> But Cloudera's release is a bit murky. >> >> The math example is a bit flawed... >> >> X represents the set of stable releases. >> Y represents the set of available patches. >> C represents the set of Cloudera releases. >> >> So if C contains a release X(n) plus a set of patches that is contained >>in Y, >> Then does it not have the right to be considered Apache Hadoop? >> It's my understanding is that any enhancement to Hadoop is made >>available to Apache and will eventually make it into a later release... >> >> So while it may not be 'official' release X(z), all of it's components >>are in Apache. >> (note: I'm talking about the core components and not Cloudera's >>additional toolsets that encompass Hadoop.) >> >> Cloudera is clearly a derivative work. >> And IMHO is the only one which can say ... 'Includes Apache Hadoop'. >> >> That doesn't mean that others can't, depending on how they implemented >>their changes. >> Based on EMC marketing material, they've done a rip and replace of HDFS. >> So it wouldn't be a superset since it doesn't contain a complete >>subset, but contains code that implements the API... So they can't say >>'Includes Apache Hadoop',but they can say it's a derivative work based >>on Apache Hadoop and then go on to show how and why, in their opinion >>their product is better.(that's marketing for you...) >> >> Clearly there are others out there... >> Hadoop on Cassandra as an example... >> >> Fragmentation of Hadoop will occur. It's inevitable. Too much money is >>on the table... >> >> But because Apache's licensing is so open, Apache will have a hard time +
Scott Carey 2011-05-17, 01:12
-
Re: Defining Hadoop Compatibility -revisiting-Segel, Mike 2011-05-17, 01:50
Let me clarify...
I searched on Hadoop as a term in any TM. Nothing came back... This means that Apache Hadoop didn't show up. Note the following: I did the basic search. I wouldn't be surprised that someone from Apache comes back and says see TM xxxxxxxx ... -Mike Sent from a remote device. Please excuse any typos... Mike Segel On May 16, 2011, at 8:12 PM, Scott Carey <[EMAIL PROTECTED]> wrote: > On trademarks, what about the phrase: "New distribution for Apache > Hadoop"? I've seen that used, and its something that replaces most of the > stack. I believe "Apache Hadoop" is trademarked in this context, even if > Hadoop alone isn't. > "Compatible with Apache Hadoop" is a smaller issue, defining some rough > guidelines for various forms of compatibility is useful for the community > (and reputable vendors), abuse of that will at least become obvious. But > "distribution for Apache Hadoop" (not too sure what 'for' means here)? Is > there any TM protection? A proprietary derivative work with most of the > guts replaced is not an Apache Hadoop distribution, nor a distribution for > Apache Hadoop. > > On 5/16/11 5:40 PM, "Segel, Mike" <[EMAIL PROTECTED]> wrote: > >> I just checked... TESS said no trademarks for Hadoop. >> So... what TM protection? :-) >> >> You are correct about derivative works. It's a moot point as long as the >> derivative work follows the T&Cs... >> >> >> >> Sent from a remote device. Please excuse any typos... >> >> Mike Segel >> >> On May 16, 2011, at 4:18 PM, "Matthew Foley" <[EMAIL PROTECTED]> wrote: >> >>> It's important to distinguish between the name "Hadoop", which is >>> protected by trademark law, >>> and the Hadoop implementation, which is licensed as opensource under >>> copyright law. >>> >>> The term "derivative work" is, I believe, only relevant under copyright >>> law, not trademark law. >>> (N.B., I'm not a lawyer -- and this email is my opinion, not my >>> employer's.) Since the Apache License >>> explicitly allows derivative works, I don't think it's a useful term >>> for this discussion. >>> >>> However, the ASF, and by delegation the Hadoop PMC, has a lot of >>> control over the name, >>> and how we allow it to be used, under trademark law. But to keeps our >>> rights under that >>> law, we have to enforce the trademark consistently. So it's good that >>> we're having this discussion, >>> and it's important to reach a conclusion, document it, and enforce it >>> consistently. >>> >>> There are a lot of subtleties; for instance, if I recall correctly from >>> my days with Adobe and >>> PostScript(R), someone who has not licensed a trademark "X" can still >>> claim "compatible with X" >>> as long as they ALSO make clear that the product is NOT, itself, an >>> "X". But you really need >>> a lawyer to get into that stuff. >>> >>> --Matt >>> >>> >>> On May 16, 2011, at 5:00 AM, Segel, Mike wrote: >>> >>> But Cloudera's release is a bit murky. >>> >>> The math example is a bit flawed... >>> >>> X represents the set of stable releases. >>> Y represents the set of available patches. >>> C represents the set of Cloudera releases. >>> >>> So if C contains a release X(n) plus a set of patches that is contained >>> in Y, >>> Then does it not have the right to be considered Apache Hadoop? >>> It's my understanding is that any enhancement to Hadoop is made >>> available to Apache and will eventually make it into a later release... >>> >>> So while it may not be 'official' release X(z), all of it's components >>> are in Apache. >>> (note: I'm talking about the core components and not Cloudera's >>> additional toolsets that encompass Hadoop.) >>> >>> Cloudera is clearly a derivative work. >>> And IMHO is the only one which can say ... 'Includes Apache Hadoop'. >>> >>> That doesn't mean that others can't, depending on how they implemented >>> their changes. >>> Based on EMC marketing material, they've done a rip and replace of HDFS. >>> So it wouldn't be a superset since it doesn't contain a complete The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files. +
Segel, Mike 2011-05-17, 01:50
-
Re: Defining Hadoop Compatibility -revisiting-Eric Baldeschwieler 2011-05-17, 02:32
My understanding is that a history if defending your trade mark is more important than registration. Apache does defend Hadoop.
--- E14 - typing on glass On May 16, 2011, at 6:52 PM, "Segel, Mike" <[EMAIL PROTECTED]> wrote: > Let me clarify... > I searched on Hadoop as a term in any TM. > Nothing came back... > > This means that Apache Hadoop didn't show up. > > Note the following: I did the basic search. I wouldn't be surprised that someone from Apache comes back and says see TM xxxxxxxx ... > > -Mike > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On May 16, 2011, at 8:12 PM, Scott Carey <[EMAIL PROTECTED]> wrote: > >> On trademarks, what about the phrase: "New distribution for Apache >> Hadoop"? I've seen that used, and its something that replaces most of the >> stack. I believe "Apache Hadoop" is trademarked in this context, even if >> Hadoop alone isn't. >> "Compatible with Apache Hadoop" is a smaller issue, defining some rough >> guidelines for various forms of compatibility is useful for the community >> (and reputable vendors), abuse of that will at least become obvious. But >> "distribution for Apache Hadoop" (not too sure what 'for' means here)? Is >> there any TM protection? A proprietary derivative work with most of the >> guts replaced is not an Apache Hadoop distribution, nor a distribution for >> Apache Hadoop. >> >> On 5/16/11 5:40 PM, "Segel, Mike" <[EMAIL PROTECTED]> wrote: >> >>> I just checked... TESS said no trademarks for Hadoop. >>> So... what TM protection? :-) >>> >>> You are correct about derivative works. It's a moot point as long as the >>> derivative work follows the T&Cs... >>> >>> >>> >>> Sent from a remote device. Please excuse any typos... >>> >>> Mike Segel >>> >>> On May 16, 2011, at 4:18 PM, "Matthew Foley" <[EMAIL PROTECTED]> wrote: >>> >>>> It's important to distinguish between the name "Hadoop", which is >>>> protected by trademark law, >>>> and the Hadoop implementation, which is licensed as opensource under >>>> copyright law. >>>> >>>> The term "derivative work" is, I believe, only relevant under copyright >>>> law, not trademark law. >>>> (N.B., I'm not a lawyer -- and this email is my opinion, not my >>>> employer's.) Since the Apache License >>>> explicitly allows derivative works, I don't think it's a useful term >>>> for this discussion. >>>> >>>> However, the ASF, and by delegation the Hadoop PMC, has a lot of >>>> control over the name, >>>> and how we allow it to be used, under trademark law. But to keeps our >>>> rights under that >>>> law, we have to enforce the trademark consistently. So it's good that >>>> we're having this discussion, >>>> and it's important to reach a conclusion, document it, and enforce it >>>> consistently. >>>> >>>> There are a lot of subtleties; for instance, if I recall correctly from >>>> my days with Adobe and >>>> PostScript(R), someone who has not licensed a trademark "X" can still >>>> claim "compatible with X" >>>> as long as they ALSO make clear that the product is NOT, itself, an >>>> "X". But you really need >>>> a lawyer to get into that stuff. >>>> >>>> --Matt >>>> >>>> >>>> On May 16, 2011, at 5:00 AM, Segel, Mike wrote: >>>> >>>> But Cloudera's release is a bit murky. >>>> >>>> The math example is a bit flawed... >>>> >>>> X represents the set of stable releases. >>>> Y represents the set of available patches. >>>> C represents the set of Cloudera releases. >>>> >>>> So if C contains a release X(n) plus a set of patches that is contained >>>> in Y, >>>> Then does it not have the right to be considered Apache Hadoop? >>>> It's my understanding is that any enhancement to Hadoop is made >>>> available to Apache and will eventually make it into a later release... >>>> >>>> So while it may not be 'official' release X(z), all of it's components >>>> are in Apache. >>>> (note: I'm talking about the core components and not Cloudera's >>>> additional toolsets that encompass Hadoop.) +
Eric Baldeschwieler 2011-05-17, 02:32
-
Re: Defining Hadoop Compatibility -revisiting-Andrew Purtell 2011-05-17, 02:52
> On trademarks, what about the phrase: "New distribution for Apache
> Hadoop"? I've seen that used, and its something that > replaces most of the stack. [...] A proprietary derivative work with > most of the guts replaced is not an Apache Hadoop distribution, nor > a distribution for Apache Hadoop. IMHO, this is the key issue. Allowing proprietary derivative works that provide Hadoop compatible APIs to claim they are Hadoop will provoke endless confusion, argument, claim, and counter-claim, and poison the well for all involved with Apache Hadoop. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) --- On Mon, 5/16/11, Scott Carey <[EMAIL PROTECTED]> wrote: > From: Scott Carey <[EMAIL PROTECTED]> > Subject: Re: Defining Hadoop Compatibility -revisiting- > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Cc: "Matthew Foley" <[EMAIL PROTECTED]> > Date: Monday, May 16, 2011, 6:12 PM > On trademarks, what about the phrase: "New distribution for Apache > Hadoop"? I've seen that used, and its something that replaces most > of the stack. I believe "Apache Hadoop" is trademarked in this > context, even if Hadoop alone isn't. "Compatible with Apache Hadoop" > is a smaller issue, defining some rough guidelines for various forms > of compatibility is useful for the community (and reputable vendors), > abuse of that will at least become obvious. But "distribution for > Apache Hadoop" (not too sure what 'for' means here)? Is there any > TM protection? A proprietary derivative work with most of the > guts replaced is not an Apache Hadoop distribution, nor a > distribution for Apache Hadoop. > > On 5/16/11 5:40 PM, "Segel, Mike" <[EMAIL PROTECTED]> > wrote: > > >I just checked... TESS said no trademarks for Hadoop. > >So... what TM protection? :-) > > > >You are correct about derivative works. It's a moot > point as long as the > >derivative work follows the T&Cs... > > > > > > > >Sent from a remote device. Please excuse any typos... > > > >Mike Segel > > > >On May 16, 2011, at 4:18 PM, "Matthew Foley" <[EMAIL PROTECTED]> > wrote: > > > >> It's important to distinguish between the name > "Hadoop", which is > >>protected by trademark law, > >> and the Hadoop implementation, which is licensed > as opensource under > >>copyright law. > >> > >> The term "derivative work" is, I believe, only > relevant under copyright > >>law, not trademark law. > >> (N.B., I'm not a lawyer -- and this email is my > opinion, not my > >>employer's.) Since the Apache License > >> explicitly allows derivative works, I don't think > it's a useful term > >>for this discussion. > >> > >> However, the ASF, and by delegation the Hadoop > PMC, has a lot of > >>control over the name, > >> and how we allow it to be used, under trademark > law. But to keeps our > >>rights under that > >> law, we have to enforce the trademark > consistently. So it's good that > >>we're having this discussion, > >> and it's important to reach a conclusion, document > it, and enforce it > >>consistently. > >> > >> There are a lot of subtleties; for instance, if I > recall correctly from > >>my days with Adobe and > >> PostScript(R), someone who has not licensed a > trademark "X" can still > >>claim "compatible with X" > >> as long as they ALSO make clear that the product > is NOT, itself, an > >>"X". But you really need > >> a lawyer to get into that stuff. > >> > >> --Matt > >> > >> > >> On May 16, 2011, at 5:00 AM, Segel, Mike wrote: > >> > >> But Cloudera's release is a bit murky. > >> > >> The math example is a bit flawed... > >> > >> X represents the set of stable releases. > >> Y represents the set of available patches. > >> C represents the set of Cloudera releases. > >> > >> So if C contains a release X(n) plus a set of > patches that is contained > >>in Y, > >> Then does it not have the right to be considered > Apache Hadoop? > >> It's my understanding is that any enhancement to +
Andrew Purtell 2011-05-17, 02:52
-
Re: Defining Hadoop Compatibility -revisiting-Matthew Foley 2011-05-17, 09:19
TESS only has "registered" trademarks -- that's the kind of trademark you put an "(R)" next to.
But you can have an ordinary unregistered trademark -- the kind you put a "tm" next to -- just by claiming it, and then promoting and defending it. In the second paragraph of our bylaws<http://hadoop.apache.org/bylaws.html> we claim: The foundation holds the trademark on the name "Hadoop" and copyright on Apache code including the code in the Hadoop codebase. In the LICENSE.txt file in our distribution, clause 6 of the Apache License states, 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. Very important! This exclusion of trademarks from the opensource license is normal and appropriate, precisely because it is the primary tool for preventing confusion and fragmentation in an opensource marketplace. However, we also have to promote and defend the trademark. I'll leave it up to the lawyers to define exactly what that means, but it does take some effort. We should probably use the "tm" annotation in our logo GIFs and our primary market-facing documents (not so much in the code). Perhaps the PMC or Apache Board can ask some of our sponsor organizations to help out with some techdocs and trademark legal review assistance. It's not really a topic for us lay people to argue over, it just wastes time. WRT Scott's comments below, understand that trademark lawyers love to talk about using trademarks only as "modifiers", i.e., essentially as adjectives. We shouldn't say "Chevrolet" as a thing, we should say "Chevrolet (tm) cars". There are many kinds of cars, and "Chevrolet" is a mark distinguishing THIS kind of car from the others. You can't trademark THINGS, you can only trademark DISTINGUISHING MARKS that differentiate things in the marketplace. And it has nothing to do with the underlying technology. So here are three statements I believe to be true: 1. We should defend against the usage "Apache Hadoop", "Cloudera Hadoop", "Yahoo Hadoop", and "EMC Hadoop". This would imply that Hadoop was a THING, and those other words -- all trademarks! -- are the modifiers. Not acceptable. 2. If the PMC chooses to, it is okay to allow usages like "Cloudera distribution of Hadoop", "Yahoo distribution of Hadoop", and "EMC distribution of Hadoop", or even "powered by Hadoop", "built on Hadoop", "Hadoop inside", or whatever, where "Hadoop" is understood to mean "Hadoop distributed computing platform" (as opposed to other kinds of distributed computing platform product). ALL of those usages are claiming to "be Hadoop" in some sense, so they are protected by the trademark on "Hadoop". Therefore, they can only be used if each of those companies obtain a license from Apache to use the Hadoop trademark, which should include an agreement about correct use of the mark. And currently they DON'T have such a license, because the Apache License specifically excludes trademarks! 3. If other companies choose to sell distributed computing platform products that are NOT named "Hadoop", but their marketing literature says these products are "compatible with Hadoop" while also making clear that they are not Hadoop and don't claim to be Hadoop -- we probably can't do anything about it. Claims of compatibility are generally protected for the sake of competition in the marketplace. If We-The-Community can come to an agreement about what compatibility means, we can build it into the license to use the "Hadoop" trademark (as in #2 above), then enforce it with peer pressure and market rejection of non-conforming products (#3). But the only real help from the law will be in distinguishing case #2 from case #3. --Matt "I am not a lawyer, the above is just my opinion, and does not represent the opinion of my employer." On May 16, 2011, at 6:12 PM, Scott Carey wrote: On trademarks, what about the phrase: "New distribution for Apache Hadoop"? I've seen that used, and its something that replaces most of the stack. I believe "Apache Hadoop" is trademarked in this context, even if Hadoop alone isn't. "Compatible with Apache Hadoop" is a smaller issue, defining some rough guidelines for various forms of compatibility is useful for the community (and reputable vendors), abuse of that will at least become obvious. But "distribution for Apache Hadoop" (not too sure what 'for' means here)? Is there any TM protection? A proprietary derivative work with most of the guts replaced is not an Apache Hadoop distribution, nor a distribution for Apache Hadoop. On 5/16/11 5:40 PM, "Segel, Mike" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: I just checked... TESS said no trademarks for Hadoop. So... what TM protection? :-) You are correct about derivative works. It's a moot point as long as the derivative work follows the T&Cs... Sent from a remote device. Please excuse any typos... Mike Segel On May 16, 2011, at 4:18 PM, "Matthew Foley" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: It's important to distinguish between the name "Hadoop", which is protected by trademark law, and the Hadoop implementation, which is licensed as opensource under copyright law. The term "derivative work" is, I believe, only relevant under copyright law, not trademark law. (N.B., I'm not a lawyer -- and this email is my opinion, not my employer's.) Since the Apache License explicitly allows derivative works, I don't think it's a useful term for this discussion. However, the ASF, and by delegation the Hadoop PMC, has a lot of control over the name, and how we allow it to be used, under trademark law. But to keeps our rights under that law, we have to enforce the trademark consistently. So it's good that we're having this discussion, and it's important to reach a con +
Matthew Foley 2011-05-17, 09:19
-
RE: Defining Hadoop Compatibility -revisiting-Segel, Mike 2011-05-17, 12:52
Well that would explain it, although Apache itself is a Registered Trade Mark.
I agree you need to step up the enforcement because you don't want Hadoop to become the next Kleenex... :-) -----Original Message----- From: Matthew Foley [mailto:[EMAIL PROTECTED]] Sent: Tuesday, May 17, 2011 4:19 AM To: [EMAIL PROTECTED] Cc: Matthew Foley Subject: Re: Defining Hadoop Compatibility -revisiting- TESS only has "registered" trademarks -- that's the kind of trademark you put an "(R)" next to. But you can have an ordinary unregistered trademark -- the kind you put a "tm" next to -- just by claiming it, and then promoting and defending it. In the second paragraph of our bylaws<http://hadoop.apache.org/bylaws.html> we claim: The foundation holds the trademark on the name "Hadoop" and copyright on Apache code including the code in the Hadoop codebase. In the LICENSE.txt file in our distribution, clause 6 of the Apache License states, 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. Very important! This exclusion of trademarks from the opensource license is normal and appropriate, precisely because it is the primary tool for preventing confusion and fragmentation in an opensource marketplace. However, we also have to promote and defend the trademark. I'll leave it up to the lawyers to define exactly what that means, but it does take some effort. We should probably use the "tm" annotation in our logo GIFs and our primary market-facing documents (not so much in the code). Perhaps the PMC or Apache Board can ask some of our sponsor organizations to help out with some techdocs and trademark legal review assistance. It's not really a topic for us lay people to argue over, it just wastes time. WRT Scott's comments below, understand that trademark lawyers love to talk about using trademarks only as "modifiers", i.e., essentially as adjectives. We shouldn't say "Chevrolet" as a thing, we should say "Chevrolet (tm) cars". There are many kinds of cars, and "Chevrolet" is a mark distinguishing THIS kind of car from the others. You can't trademark THINGS, you can only trademark DISTINGUISHING MARKS that differentiate things in the marketplace. And it has nothing to do with the underlying technology. So here are three statements I believe to be true: 1. We should defend against the usage "Apache Hadoop", "Cloudera Hadoop", "Yahoo Hadoop", and "EMC Hadoop". This would imply that Hadoop was a THING, and those other words -- all trademarks! -- are the modifiers. Not acceptable. 2. If the PMC chooses to, it is okay to allow usages like "Cloudera distribution of Hadoop", "Yahoo distribution of Hadoop", and "EMC distribution of Hadoop", or even "powered by Hadoop", "built on Hadoop", "Hadoop inside", or whatever, where "Hadoop" is understood to mean "Hadoop distributed computing platform" (as opposed to other kinds of distributed computing platform product). ALL of those usages are claiming to "be Hadoop" in some sense, so they are protected by the trademark on "Hadoop". Therefore, they can only be used if each of those companies obtain a license from Apache to use the Hadoop trademark, which should include an agreement about correct use of the mark. And currently they DON'T have such a license, because the Apache License specifically excludes trademarks! 3. If other companies choose to sell distributed computing platform products that are NOT named "Hadoop", but their marketing literature says these products are "compatible with Hadoop" while also making clear that they are not Hadoop and don't claim to be Hadoop -- we probably can't do anything about it. Claims of compatibility are generally protected for the sake of competition in the marketplace. If We-The-Community can come to an agreement about what compatibility means, we can build it into the license to use the "Hadoop" trademark (as in #2 above), then enforce it with peer pressure and market rejection of non-conforming products (#3). But the only real help from the law will be in distinguishing case #2 from case #3. "I am not a lawyer, the above is just my opinion, and does not represent the opinion of my employer." On May 16, 2011, at 6:12 PM, Scott Carey wrote: On trademarks, what about the phrase: "New distribution for Apache Hadoop"? I've seen that used, and its something that replaces most of the stack. I believe "Apache Hadoop" is trademarked in this context, even if Hadoop alone isn't. "Compatible with Apache Hadoop" is a smaller issue, defining some rough guidelines for various forms of compatibility is useful for the community (and reputable vendors), abuse of that will at least become obvious. But "distribution for Apache Hadoop" (not too sure what 'for' means here)? Is there any TM protection? A proprietary derivative work with most of the guts replaced is not an Apache Hadoop distribution, nor a distribution for Apache Hadoop. On 5/16/11 5:40 PM, "Segel, Mike" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: I just checked... TESS said no trademarks for Hadoop. So... what TM protection? :-) You are correct about derivative works. It's a moot point as long as the derivative work follows the T&Cs... Sent from a remote device. Please excuse any typos... Mike Segel On May 16, 2011, at 4:18 PM, "Matthew Foley" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: It's important to distinguish between the name "Hadoop", which is protected by trademark law, and the Hadoop implementation, which is licensed as opensource under copyright law. The term "derivative work" is, I believe, only relevant under copyright law, not trademark law. (N.B., I'm not a lawyer -- and this email is my opinion, not my employer's.) Since the Apache License e +
Segel, Mike 2011-05-17, 12:52
-
Re: Defining Hadoop Compatibility -revisiting-Doug Cutting 2011-05-17, 13:24
Matt,
Have you read Apache's trademark policy page? http://www.apache.org/foundation/marks/ Apache does not generally license its trademarks. Constructions like, "Acme Foo powered by Apache Bar" are generally permitted as they are not deemed to create confusion about the origin of Bar. Cheers, Doug On 05/17/2011 11:19 AM, Matthew Foley wrote: > TESS only has "registered" trademarks -- that's the kind of trademark you put an "(R)" next to. > But you can have an ordinary unregistered trademark -- the kind you put a "tm" next to -- > just by claiming it, and then promoting and defending it. > > In the second paragraph of our bylaws<http://hadoop.apache.org/bylaws.html> we claim: > The foundation holds the trademark on the name "Hadoop" and copyright on > Apache code including the code in the Hadoop codebase. > In the LICENSE.txt file in our distribution, clause 6 of the Apache License states, > 6. Trademarks. This License does not grant permission to use the trade > names, trademarks, service marks, or product names of the Licensor, > except as required for reasonable and customary use in describing the > origin of the Work and reproducing the content of the NOTICE file. > Very important! This exclusion of trademarks from the opensource license is normal > and appropriate, precisely because it is the primary tool for preventing confusion and > fragmentation in an opensource marketplace. > > However, we also have to promote and defend the trademark. I'll leave it up to the > lawyers to define exactly what that means, but it does take some effort. We should > probably use the "tm" annotation in our logo GIFs and our primary market-facing > documents (not so much in the code). Perhaps the PMC or Apache Board can ask some > of our sponsor organizations to help out with some techdocs and trademark legal review > assistance. It's not really a topic for us lay people to argue over, it just wastes time. > > WRT Scott's comments below, understand that trademark lawyers love to talk about using > trademarks only as "modifiers", i.e., essentially as adjectives. We shouldn't say "Chevrolet" > as a thing, we should say "Chevrolet (tm) cars". There are many kinds of cars, and "Chevrolet" > is a mark distinguishing THIS kind of car from the others. You can't trademark THINGS, you > can only trademark DISTINGUISHING MARKS that differentiate things in the marketplace. > And it has nothing to do with the underlying technology. > > So here are three statements I believe to be true: > > 1. We should defend against the usage "Apache Hadoop", "Cloudera Hadoop", > "Yahoo Hadoop", and "EMC Hadoop". This would imply that Hadoop was a THING, > and those other words -- all trademarks! -- are the modifiers. Not acceptable. > > 2. If the PMC chooses to, it is okay to allow usages like "Cloudera distribution of > Hadoop", "Yahoo distribution of Hadoop", and "EMC distribution of Hadoop", or even > "powered by Hadoop", "built on Hadoop", "Hadoop inside", or whatever, where > "Hadoop" is understood to mean "Hadoop distributed computing platform" > (as opposed to other kinds of distributed computing platform product). > ALL of those usages are claiming to "be Hadoop" in some sense, so they are > protected by the trademark on "Hadoop". Therefore, they can only be used if each > of those companies obtain a license from Apache to use the Hadoop trademark, > which should include an agreement about correct use of the mark. And currently > they DON'T have such a license, because the Apache License specifically excludes > trademarks! > > 3. If other companies choose to sell distributed computing platform products that > are NOT named "Hadoop", but their marketing literature says these products are > "compatible with Hadoop" while also making clear that they are not Hadoop and > don't claim to be Hadoop -- we probably can't do anything about it. Claims of > compatibility are generally protected for the sake of competition in the marketplace. +
Doug Cutting 2011-05-17, 13:24
-
Re: Defining Hadoop Compatibility -revisiting-Matthew Foley 2011-05-17, 17:53
> Constructions like, "Acme Foo powered by Apache Bar" are generally permitted...
Hi Doug, Great document, very typical of company Trademark Policy documents. It needs to be read in conjunction with the FAQ at http://www.apache.org/foundation/marks/faq/ which is where the "powered by" mark usage is authorized. The "Apache Project Branding Requirements" for PMCs, http://www.apache.org/foundation/marks/pmcs.html goes into more depth, apparently covering all of what I said in my prior email, and more. The FAQ about the "powered by" usage is not a general permission to use similar constructions; rather it states 8 prescriptive guidelines for exactly how it is okay to use this specific construction. The first one of those requirements is that it can only be used for: "products or services that are supersets of the functionality of an Apache product, or services [that] are run atop Apache products" And this statement of permission in the publicly available FAQ constitutes a license, so it is imprecise to say that ASF doesn't license its trademarks. :-) The Project Branding page authorizes PMCs to develop their own "Powered by <project>" or "<project> Inside" programs. Has the Hadoop PMC done so? Is there a web page for that? Our "PoweredBy" page is only a list of companies and products. I do need to clarify one thing I said below w.r.t. the "Apache Hadoop" construction. It's perfectly fine to say "Apache Hadoop" as long as there is NOT ALSO a "Yahoo Hadoop", a "Cloudera Hadoop", and an "EMC Hadoop". And indeed such competing usages are forbidden by this very fine Trademark Policy document, while mandating the "Apache Hadoop" usage. Cheers, --Matt On May 17, 2011, at 6:24 AM, Doug Cutting wrote: Matt, Have you read Apache's trademark policy page? http://www.apache.org/foundation/marks/ Apache does not generally license its trademarks. Constructions like, "Acme Foo powered by Apache Bar" are generally permitted as they are not deemed to create confusion about the origin of Bar. Cheers, Doug On 05/17/2011 11:19 AM, Matthew Foley wrote: > TESS only has "registered" trademarks -- that's the kind of trademark you put an "(R)" next to. > But you can have an ordinary unregistered trademark -- the kind you put a "tm" next to -- > just by claiming it, and then promoting and defending it. > > In the second paragraph of our bylaws<http://hadoop.apache.org/bylaws.html> we claim: > The foundation holds the trademark on the name "Hadoop" and copyright on > Apache code including the code in the Hadoop codebase. > In the LICENSE.txt file in our distribution, clause 6 of the Apache License states, > 6. Trademarks. This License does not grant permission to use the trade > names, trademarks, service marks, or product names of the Licensor, > except as required for reasonable and customary use in describing the > origin of the Work and reproducing the content of the NOTICE file. > Very important! This exclusion of trademarks from the opensource license is normal > and appropriate, precisely because it is the primary tool for preventing confusion and > fragmentation in an opensource marketplace. > > However, we also have to promote and defend the trademark. I'll leave it up to the > lawyers to define exactly what that means, but it does take some effort. We should > probably use the "tm" annotation in our logo GIFs and our primary market-facing > documents (not so much in the code). Perhaps the PMC or Apache Board can ask some > of our sponsor organizations to help out with some techdocs and trademark legal review > assistance. It's not really a topic for us lay people to argue over, it just wastes time. > > WRT Scott's comments below, understand that trademark lawyers love to talk about using > trademarks only as "modifiers", i.e., essentially as adjectives. We shouldn't say "Chevrolet" > as a thing, we should say "Chevrolet (tm) cars". There are many kinds of cars, and "Chevrolet" +
Matthew Foley 2011-05-17, 17:53
-
Re: Defining Hadoop Compatibility -revisiting-Doug Cutting 2011-05-18, 13:20
On 05/17/2011 07:53 PM, Matthew Foley wrote:
> And this statement of permission in the publicly available FAQ constitutes a license, > so it is imprecise to say that ASF doesn't license its trademarks. :-) That's not the way I interpret it. I believe that a license would be required to permit a use that might create confusion while the FAQ provides sample stock phrases that are not thought to create confusion, i.e., nominal uses. > The Project Branding page authorizes PMCs to develop their own "Powered by <project>" > or "<project> Inside" programs. Has the Hadoop PMC done so? Is there a web page > for that? Our "PoweredBy" page is only a list of companies and products. There is an open issue to create a distinct "Powered by Hadoop" logo: https://issues.apache.org/jira/browse/HADOOP-7020 Doug +
Doug Cutting 2011-05-18, 13:20
-
Re: Defining Hadoop Compatibility -revisiting-Roy T. Fielding 2011-05-13, 22:26
On May 13, 2011, at 2:55 PM, Doug Cutting wrote:
> On 05/13/2011 07:28 PM, Allen Wittenauer wrote: >> If it has a modified version of Hadoop (i.e., not an actual Apache >> release or patches which have never been committed to trunk), are >> they allowed to say "includes Apache Hadoop"? > > No. Those are the two cases we permit. We used to say that it was > enough for a patch to be in Jira, but Roy clarified last year that > committed to trunk is a better line, since that means the code has been > reviewed and accepted by the community. Committed to a releasable branch, actually. In other words, a branch (including trunk) that the PMC is collaborating on towards release at some point in the future. ....Roy +
Roy T. Fielding 2011-05-13, 22:26
-
Re: Defining Hadoop Compatibility -revisiting-Eric Baldeschwieler 2011-05-16, 05:34
Good point.
On May 12, 2011, at 11:16 PM, Doug Cutting wrote: > Certification semms like mission creep. Our mission is to produce > open-source software. If we wish to produce testing software, that > seems fine. But running a certification program for non-open-source > software seems like a different task. > > The Hadoop mark should only be used to refer to open-source software > produced by the ASF. If other folks wish to make factual statements > concerning our software, e.g., that their proprietary software passes > tests that we've created, that may be fine, but I don't think we should > validate those claims by granting certifications to institutions. That > ventures outside the mission of the ASF. We are not an accrediting > organization. > > Doug > > On 05/10/2011 12:29 PM, Steve Loughran wrote: >> >> Back in Jan 2011, I started a discussion about how to define Apache >> Hadoop Compatibility: >> http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%[EMAIL PROTECTED]%3E >> >> >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet >> >> http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_1.pdf >> >> >> It claims that their implementations are 100% compatible, even though >> the Enterprise edition uses a C filesystem. It also claims that both >> their software releases contain "Certified Stacks", without defining >> what Certified means, or who does the certification -only that it is an >> improvement. >> >> >> I think we should revisit this issue before people with their own >> agendas define what compatibility with Apache Hadoop is for us >> >> >> Licensing >> -Use of the Hadoop codebase must follow the Apache License >> http://www.apache.org/licenses/LICENSE-2.0 >> -plug in components that are dynamically linked to (Filesystems and >> schedulers) don't appear to be derivative works on my reading of this, >> >> Naming >> -this is something for branding@apache, they will have their opinions. >> The key one is that the name "Apache Hadoop" must get used, and it's >> important to make clear it is a derivative work. >> -I don't think you can claim to have a Distribution/Fork/Version of >> Apache Hadoop if you swap out big chunks of it for alternate >> filesystems, MR engines, etc. Some description of this is needed >> "Supports the Apache Hadoop MapReduce engine on top of Filesystem XYZ" >> >> Compatibility >> -the definition of the Hadoop interfaces and classes is the Apache >> Source tree, >> -the definition of semantics of the Hadoop interfaces and classes is >> the Apache Source tree, including the test classes. >> -the verification that the actual semantics of an Apache Hadoop release >> is compatible with the expected semantics is that current and future >> tests pass >> -bug reports can highlight incompatibility with expectations of >> community users, and once incorporated into tests form part of the >> compatibility testing >> -vendors can claim and even certify their derivative works as >> compatible with other versions of their derivative works, but cannot >> claim compatibility with Apache Hadoop unless their code passes the >> tests and is consistent with the bug reports marked as ("by design"). >> Perhaps we should have tests that verify each of these "by design" >> bugreps to make them more formal. >> >> Certification >> -I have no idea what this means in EMC's case, they just say "Certified" >> -As we don't do any certification ourselves, it would seem impossible >> for us to certify that any derivative work is compatible. >> -It may be best to state that nobody can certify their derivative as >> "compatible with Apache Hadoop" unless it passes all current test suites >> -And require that anyone who declares compatibility define what they >> mean by this >> >> This is a good argument for getting more functional tests out there >> -whoever has more functional tests needs to get them into a test module >> that can be used to test real deployments. +
Eric Baldeschwieler 2011-05-16, 05:34
-
Re: Defining Hadoop Compatibility -revisiting-Steve Loughran 2011-05-16, 11:20
On 13/05/11 07:16, Doug Cutting wrote:
> Certification semms like mission creep. Our mission is to produce > open-source software. If we wish to produce testing software, that > seems fine. But running a certification program for non-open-source > software seems like a different task. > +1 That said, some stricter definition of public interfaces may be useful for the related projects, as a consistent open source stack is strongly beneficial. > The Hadoop mark should only be used to refer to open-source software > produced by the ASF. If other folks wish to make factual statements > concerning our software, e.g., that their proprietary software passes > tests that we've created, that may be fine, but I don't think we should > validate those claims by granting certifications to institutions. That > ventures outside the mission of the ASF. We are not an accrediting > organization. +1. Apache is not a standards body, except in the form of "de-facto standards defined by working code and their test suite" What it does have a strict rules about naming. We should formalise them and publish them on the wiki, then whenever some product gets press-released (it's like a beta-release, only earlier in the lifecycle), the vendor can be directed to the page and reminded of the T&Cs of the license and any trade marks. What does this mean for T-Shirts and Stickers, incidentally? +
Steve Loughran 2011-05-16, 11:20
-
Re: Defining Hadoop Compatibility -revisiting-Sanjay Radia 2011-05-23, 16:27
Agree.
On May 12, 2011, at 11:16 PM, Doug Cutting wrote: > Certification semms like mission creep. Our mission is to produce > open-source software. If we wish to produce testing software, that > seems fine. But running a certification program for non-open-source > software seems like a different task. > > The Hadoop mark should only be used to refer to open-source software > produced by the ASF. If other folks wish to make factual statements > concerning our software, e.g., that their proprietary software passes > tests that we've created, that may be fine, but I don't think we > should > validate those claims by granting certifications to institutions. > That > ventures outside the mission of the ASF. We are not an accrediting > organization. > > Doug > > On 05/10/2011 12:29 PM, Steve Loughran wrote: >> >> Back in Jan 2011, I started a discussion about how to define Apache >> Hadoop Compatibility: >> http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%[EMAIL PROTECTED]%3E >> >> >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet >> >> http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_1.pdf >> >> >> It claims that their implementations are 100% compatible, even though >> the Enterprise edition uses a C filesystem. It also claims that both >> their software releases contain "Certified Stacks", without defining >> what Certified means, or who does the certification -only that it >> is an >> improvement. >> >> >> I think we should revisit this issue before people with their own >> agendas define what compatibility with Apache Hadoop is for us >> >> >> Licensing >> -Use of the Hadoop codebase must follow the Apache License >> http://www.apache.org/licenses/LICENSE-2.0 >> -plug in components that are dynamically linked to (Filesystems and >> schedulers) don't appear to be derivative works on my reading of >> this, >> >> Naming >> -this is something for branding@apache, they will have their >> opinions. >> The key one is that the name "Apache Hadoop" must get used, and it's >> important to make clear it is a derivative work. >> -I don't think you can claim to have a Distribution/Fork/Version of >> Apache Hadoop if you swap out big chunks of it for alternate >> filesystems, MR engines, etc. Some description of this is needed >> "Supports the Apache Hadoop MapReduce engine on top of Filesystem >> XYZ" >> >> Compatibility >> -the definition of the Hadoop interfaces and classes is the Apache >> Source tree, >> -the definition of semantics of the Hadoop interfaces and classes is >> the Apache Source tree, including the test classes. >> -the verification that the actual semantics of an Apache Hadoop >> release >> is compatible with the expected semantics is that current and future >> tests pass >> -bug reports can highlight incompatibility with expectations of >> community users, and once incorporated into tests form part of the >> compatibility testing >> -vendors can claim and even certify their derivative works as >> compatible with other versions of their derivative works, but cannot >> claim compatibility with Apache Hadoop unless their code passes the >> tests and is consistent with the bug reports marked as ("by design"). >> Perhaps we should have tests that verify each of these "by design" >> bugreps to make them more formal. >> >> Certification >> -I have no idea what this means in EMC's case, they just say >> "Certified" >> -As we don't do any certification ourselves, it would seem impossible >> for us to certify that any derivative work is compatible. >> -It may be best to state that nobody can certify their derivative as >> "compatible with Apache Hadoop" unless it passes all current test >> suites >> -And require that anyone who declares compatibility define what they >> mean by this >> >> This is a good argument for getting more functional tests out there >> -whoever has more functional tests needs to get them into a test >> module +
Sanjay Radia 2011-05-23, 16:27
-
Re: Defining Hadoop Compatibility -revisiting-Steve Loughran 2011-05-24, 16:23
I've drafted a policy on the wiki based on this discussion. http://wiki.apache.org/hadoop/Defining%20Hadoop Others need to look at, edit, etc, then we can vote on whether to take it into the managed documentation. +
Steve Loughran 2011-05-24, 16:23
-
Re: Defining Hadoop Compatibility -revisiting-Owen O'Malley 2011-05-31, 22:08
On May 24, 2011, at 9:23 AM, Steve Loughran wrote: > > I've drafted a policy on the wiki based on this discussion. > > http://wiki.apache.org/hadoop/Defining%20Hadoop > > Others need to look at, edit, etc, then we can vote on whether to take it into the managed documentation. I think it looks great. Thanks for your work at drafting it, Steve. It addresses the rapidly growing problem of Do we need the escape clause that states "or other products which have written approval from the VP, Apache Brand Management?" I think that will just cause lots of petitions from the numerous companies with rationalizations about why their exception should be allowed. If we limit Hadoop to mean precisely the Apache releases, there is no ambiguity or room to argue. -- Owen +
Owen O'Malley 2011-05-31, 22:08
|