|
Owen O'Malley
2010-11-29, 22:30
Doug Cutting
2010-11-29, 23:14
Owen O'Malley
2010-11-30, 00:22
Doug Cutting
2010-12-01, 15:21
Owen O'Malley
2010-12-01, 15:40
Doug Cutting
2010-12-06, 19:30
Chris Douglas
2010-12-06, 22:40
Doug Cutting
2010-12-06, 23:45
Roy T. Fielding
2010-12-07, 01:09
Arun C Murthy
2010-12-07, 16:45
Doug Cutting
2010-12-07, 17:18
Roy T. Fielding
2010-12-07, 22:37
Doug Cutting
2010-12-08, 18:12
Owen O'Malley
2010-12-14, 03:08
Eric Sammer
2010-12-14, 04:49
Owen O'Malley
2010-12-14, 05:43
Eric Sammer
2010-12-14, 07:14
Owen O'Malley
2010-12-14, 19:08
Jay Booth
2010-12-01, 16:29
Scott Carey
2010-12-08, 03:33
Konstantin Shvachko
2010-12-01, 01:57
Owen O'Malley
2010-12-01, 19:11
Owen O'Malley
2010-12-06, 17:16
Chris Douglas
2010-12-06, 18:40
Arun C Murthy
2010-12-06, 18:46
Tom White
2010-12-06, 21:14
Konstantin Shvachko
2010-12-07, 11:27
Doug Cutting
2010-12-07, 17:22
Konstantin Shvachko
2010-12-07, 18:26
Doug Cutting
2010-12-08, 18:55
Steve Loughran
2010-12-01, 12:25
Eric Sammer
2010-12-07, 03:36
Owen O'Malley
2010-12-07, 08:13
Jeff Hammerbacher
2010-12-07, 10:23
Arun C Murthy
2010-12-07, 16:12
Doug Cutting
2010-12-07, 17:26
Owen O'Malley
2010-12-07, 18:25
Doug Cutting
2010-12-08, 19:20
Eric Sammer
2010-12-07, 18:08
Arun C Murthy
2010-12-07, 15:55
Jay Booth
2010-12-07, 16:06
|
-
[VOTE] Direction for Hadoop developmentOwen O'Malley 2010-11-29, 22:30
All,
Based on the discussion on HADOOP-6685, there is a pretty fundamental difference of opinion about how Hadoop should evolve. We need to figure out how the majority of the PMC wants the project to evolve to understand which patches move us forward. Please vote whether you approve of the following direction. Clearly as the author, I'm +1. -- Owen Hadoop has always included library code so that users had a strong foundation to build their applications on without needing to continually reinvent the wheel. This combination of framework and powerful library code is a common pattern for successful projects, such as Java, Lucene, etc. Toward that end, we need to continue to extend the Hadoop library code and actively maintain it as the framework evolves. Continuing support for SequenceFile and TFile, which are both widely used is mandatory. The opposite pattern of implementing the framework and letting each distribution add the required libraries will lead to increased community fragmentation and vendor lock in. Hadoop's generic serialization framework had a lot of promise when it was introduced, but has been hampered by a lack of plugins other than Writables and Java serialization. Supporting a wide range of serializations natively in Hadoop will give the users new capabilities. Currently, to support Avro or ProtoBuf objects mutually incompatible third party solutions are required. It benefits Hadoop to support them with a common framework that will support all of them. In particular, having easy, out of the box support for Thrift, ProtoBufs, Avro, and our legacy serializations is a desired state. As a distributed system, there are many instances where Hadoop needs to serialize data. Many of those applications need a lightweight, versioned serialization framework like ProtocolBuffers or Thrift and using them is appropriate. Adding dependences on Thrift and ProtocolBuffers to the previous dependence on Avro is acceptable. +
Owen O'Malley 2010-11-29, 22:30
-
Re: [VOTE] Direction for Hadoop developmentDoug Cutting 2010-11-29, 23:14
Owen,
First, I don't see the yes/no issue that you'd like us to vote on here. We vote on patches. We vote on releases. We vote on committers. We don't vote on a project direction statement. Rather folks present plans, others may present their conflicting concerns, and we need to get these to meet in order to make progress on a particular issue. I too support continuing support for SequenceFile. I too support adding flexible serialization APIs to MapReduce. I do not support extending SequenceFile's format in substantial ways. A proliferation of expressively equivalent yet incompatible file formats hinders the interoperable evolution of the Hadoop ecosystem. I do not support adding new dependencies to the classpath of MapReduce user tasks. We want to provide as much flexibility to user code as possible. The more libraries the system includes the greater the potential for version conflicts. As the Hadoop ecosystem expands, MapReduce should seek primarily to be an efficient, reliable kernel, not an extensive library of tools. So I agree with some of your points, but not with others. Cheers, Doug On 11/29/2010 02:30 PM, Owen O'Malley wrote: > All, > Based on the discussion on HADOOP-6685, there is a pretty fundamental > difference of opinion about how Hadoop should evolve. We need to figure > out how the majority of the PMC wants the project to evolve to > understand which patches move us forward. Please vote whether you > approve of the following direction. Clearly as the author, I'm +1. > > -- Owen > > Hadoop has always included library code so that users had a strong > foundation to build their applications on without needing to continually > reinvent the wheel. This combination of framework and powerful library > code is a common pattern for successful projects, such as Java, Lucene, > etc. Toward that end, we need to continue to extend the Hadoop library > code and actively maintain it as the framework evolves. Continuing > support for SequenceFile and TFile, which are both widely used is > mandatory. The opposite pattern of implementing the framework and > letting each distribution add the required libraries will lead to > increased community fragmentation and vendor lock in. > > Hadoop's generic serialization framework had a lot of promise when it > was introduced, but has been hampered by a lack of plugins other than > Writables and Java serialization. Supporting a wide range of > serializations natively in Hadoop will give the users new capabilities. > Currently, to support Avro or ProtoBuf objects mutually incompatible > third party solutions are required. It benefits Hadoop to support them > with a common framework that will support all of them. In particular, > having easy, out of the box support for Thrift, ProtoBufs, Avro, and our > legacy serializations is a desired state. > > As a distributed system, there are many instances where Hadoop needs to > serialize data. Many of those applications need a lightweight, versioned > serialization framework like ProtocolBuffers or Thrift and using them is > appropriate. Adding dependences on Thrift and ProtocolBuffers to the > previous dependence on Avro is acceptable. +
Doug Cutting 2010-11-29, 23:14
-
Re: [VOTE] Direction for Hadoop developmentOwen O'Malley 2010-11-30, 00:22
On Mon, Nov 29, 2010 at 3:14 PM, Doug Cutting <[EMAIL PROTECTED]> wrote:
We don't vote on a project direction statement. Rather folks present plans, > others may present their conflicting concerns, and we need to get these to > meet in order to make progress on a particular issue. > We haven't in the past, but clearly there is currently disagreement on the direction for Hadoop that is blocking forward progress. Using a vote to decide on project direction is far better than vetoing patches based on disjoint visions of how to move forward. If the project votes for a direction, a veto that is based on an opposing direction is clearly invalid. I too support continuing support for SequenceFile. > I said far more than that. I said that it should be actively maintained as the framework evolves. Clearly the generic serialization changes are far more useful if they include a file format that uses them. Extending SequenceFile and TFile is a good thing. I do not support extending SequenceFile's format in substantial ways. A > proliferation of expressively equivalent yet incompatible file formats > hinders the interoperable evolution of the Hadoop ecosystem. > There is no equivalent functionality and even if there was, it would still be worthwhile extending SequenceFile since they are so heavily used. Making SequenceFile support the new serialization API increases their value with minimal disruption to users. I do not support adding new dependencies to the classpath of MapReduce user > tasks. That isn't reasonable. As Hadoop evolves, we have and will continue to add dependences. For example, in your last MapReduce (MAPREDUCE-980) patch you added avro and paranamer as dependences. -- Owen +
Owen O'Malley 2010-11-30, 00:22
-
Re: [VOTE] Direction for Hadoop developmentDoug Cutting 2010-12-01, 15:21
On 11/29/2010 04:22 PM, Owen O'Malley wrote:
>Using a vote to > decide on project direction is far better than vetoing patches based on > disjoint visions of how to move forward. If the project votes for a > direction, a veto that is based on an opposing direction is clearly invalid. This is unfortunately not the way Apache projects operate. Vetos are not overridden by a majority votes. > For example, in your last MapReduce (MAPREDUCE-980) patch you > added avro and paranamer as dependences. If I'm not mistaken, that only adds a dependency to the JobTracker. We don't create specific classpaths for daemons than for user code, but we probably should, so that things that only the daemon uses are not also placed on the users classpath. Doug +
Doug Cutting 2010-12-01, 15:21
-
Re: [VOTE] Direction for Hadoop developmentOwen O'Malley 2010-12-01, 15:40
On Dec 1, 2010, at 7:21 AM, Doug Cutting wrote: > On 11/29/2010 04:22 PM, Owen O'Malley wrote: >> Using a vote to >> decide on project direction is far better than vetoing patches >> based on >> disjoint visions of how to move forward. If the project votes for a >> direction, a veto that is based on an opposing direction is clearly >> invalid. > > This is unfortunately not the way Apache projects operate. Vetos > are not overridden by a majority votes. This isn't overriding the veto. You've based your veto on changes in project plan that have never been discussed or agreed to. At Ian's suggestion, I wrote my statement to clarify what the project wanted to do. It is completely appropriate for the PMC to vote about what the project should do going forward. I would hope that if the PMC votes for my proposal that you'd withdraw your veto since the basis for your veto would have been rejected. -- Owen +
Owen O'Malley 2010-12-01, 15:40
-
Re: [VOTE] Direction for Hadoop developmentDoug Cutting 2010-12-06, 19:30
On 12/01/2010 07:40 AM, Owen O'Malley wrote:
>> This is unfortunately not the way Apache projects operate. Vetos are >> not overridden by a majority votes. > > This isn't overriding the veto. You've based your veto on changes in > project plan that have never been discussed or agreed to. Apache projects don't have project plans that are created by majority votes. Have a look at, e.g.: http://httpd.apache.org/dev/guidelines.html Majority votes are only used for releases. Plans are announced to get feedback, to aid in consensus building. Doug +
Doug Cutting 2010-12-06, 19:30
-
Re: [VOTE] Direction for Hadoop developmentChris Douglas 2010-12-06, 22:40
On Mon, Dec 6, 2010 at 11:30 AM, Doug Cutting <[EMAIL PROTECTED]> wrote:
> On 12/01/2010 07:40 AM, Owen O'Malley wrote: >>> >>> This is unfortunately not the way Apache projects operate. Vetos are >>> not overridden by a majority votes. >> >> This isn't overriding the veto. You've based your veto on changes in >> project plan that have never been discussed or agreed to. > > Apache projects don't have project plans that are created by majority votes. This reasoning is exactly opposite the facts at issue. The veto of HADOOP-6685 was based, in part, on a component that the patch proposed to modify. Due to an individual's project plan- one that "does not support" extending that component- that change was not allowed to go forward. Asserting that the PMC does not create project plans by majority vote is correct in some ways; the PMC does not plan out every feature and pre-approve all work contributed in a release. This does not require that its members remain collectively mute on the project's direction. It is nonsense to assert that every PMC member has the right to block work because it conflicts with their personal vision for Hadoop. That doesn't scale. Instead, it makes authority so ambiguous and diffuse that project direction defaults to the agenda of bullies. This has been turned on its head. The procedural question is whether individual votes establish project plans, not majority votes. To avoid a pedantic, lawyerly debate on whether this aspect of the veto was valid, its premise was raised as an issue for the community, so it could reach consensus and define the scope of its project. This approach was supposed to incite more constructive debate on an obvious difference in how participants see Hadoop development over the long term. So do you want to have that discussion, or do you want to bicker about whether you'll be forced to accept the result? -C On the vote: I'm +1 on supporting library/platform code in the Hadoop project, particularly in MapReduce. Reducing MR to a distributed sort implementation is not a direction I'm interested in. +
Chris Douglas 2010-12-06, 22:40
-
Re: [VOTE] Direction for Hadoop developmentDoug Cutting 2010-12-06, 23:45
On 12/06/2010 02:40 PM, Chris Douglas wrote:
> It is nonsense to assert that every PMC member has the right to block > work because it conflicts with their personal vision [ ... ] This is the way Apache projects operate. It requires that folks listen to criticism and potentially accept compromises if they wish to make progress. If folks cannot reach consensus in an area then that area will not make progress. > On the vote: I'm +1 on supporting library/platform code in the Hadoop > project, particularly in MapReduce. Reducing MR to a distributed sort > implementation is not a direction I'm interested in. I am interested in having this project primarily deliver a reliable, efficient MapReduce kernel implementation. That's the core functionality that folks seek to not recreate. The project should focus on a minimal, low-level MapReduce API for this kernel and permit other projects to build higher-level abstractions. Doug +
Doug Cutting 2010-12-06, 23:45
-
Re: [VOTE] Direction for Hadoop developmentRoy T. Fielding 2010-12-07, 01:09
On Dec 6, 2010, at 3:45 PM, Doug Cutting wrote:
> On 12/06/2010 02:40 PM, Chris Douglas wrote: >> It is nonsense to assert that every PMC member has the right to block >> work because it conflicts with their personal vision [ ... ] > > This is the way Apache projects operate. It requires that folks listen to criticism and potentially accept compromises if they wish to make progress. If folks cannot reach consensus in an area then that area will not make progress. Generally speaking, vetoing extension interfaces without a compelling technical reason is not the way Apache operates. We make extensions modular so that diverse collaborators can specialize according to their own needs, not just your needs. The compelling reason would be a measured performance impact or some other objective degradation of the existing product that can be evaluated by others as a cost/benefit tradeoff and perhaps compensated by modifying the implementation. >> On the vote: I'm +1 on supporting library/platform code in the Hadoop >> project, particularly in MapReduce. Reducing MR to a distributed sort >> implementation is not a direction I'm interested in. > > I am interested in having this project primarily deliver a reliable, efficient MapReduce kernel implementation. That's the core functionality that folks seek to not recreate. The project should focus on a minimal, low-level MapReduce API for this kernel and permit other projects to build higher-level abstractions. That is something people can vote on. Changes to the existing products, including plans like the one Owen described, are subject to vote if anyone disagrees with them. They are also subject to veto if and only if they are to be applied to the current release branch (or a released branch). If a PMC member insists on making design opinion the sole basis of their vetoes, then they are not collaborating with the rest of the PMC. The board will recommend that such a person be removed from the PMC so that the majority can continue to develop the product in peace. If there is enough interest in a parallel line of development, then the board will recommend splitting the PMC into two or more projects that can compete on the merits of their own designs, with the existing product name remaining with the majority. Both recommendations are based on our experience with Tomcat (which quickly solved the disagreement on its own, once the choices were laid out, by allowing divergent designs on separate major versions of the same product). ....Roy +
Roy T. Fielding 2010-12-07, 01:09
-
Re: [VOTE] Direction for Hadoop developmentArun C Murthy 2010-12-07, 16:45
On Dec 6, 2010, at 5:09 PM, Roy T. Fielding wrote: > On Dec 6, 2010, at 3:45 PM, Doug Cutting wrote: >> On 12/06/2010 02:40 PM, Chris Douglas wrote: >>> It is nonsense to assert that every PMC member has the right to >>> block >>> work because it conflicts with their personal vision [ ... ] >> >> This is the way Apache projects operate. It requires that folks >> listen to criticism and potentially accept compromises if they wish >> to make progress. If folks cannot reach consensus in an area then >> that area will not make progress. > > Generally speaking, vetoing extension interfaces without a compelling > technical reason is not the way Apache operates. We make extensions > modular so that diverse collaborators can specialize according to > their own needs, not just your needs. Thanks for the clarifications Roy. Doug, do you wish to retract your veto on extensions to SequenceFile? We can continue discussions on the rest of the technical merits of the jira. thanks, Arun +
Arun C Murthy 2010-12-07, 16:45
-
Re: [VOTE] Direction for Hadoop developmentDoug Cutting 2010-12-07, 17:18
On 12/06/2010 05:09 PM, Roy T. Fielding wrote:
> Generally speaking, vetoing extension interfaces without a compelling > technical reason is not the way Apache operates. We make extensions > modular so that diverse collaborators can specialize according to > their own needs, not just your needs. Roy, thanks for your thoughts here. I have not intended to veto an extension mechanism. In this case, we already have an extension mechanism. The proposal is to change the extension mechanism incompatibly with unclear benefits, add implementations of several extensions to the kernel, and incompatibly change a widely-used file format. In general I support improving extension mechanisms, but oppose gratuitous changes to file formats and the inclusion of new user-level functionality in the kernel. I'd like the issue to focus solely on the extension mechanism to clarify the discussion, not on adding extensions to the kernel or file formats. Tom long ago provided patches showing how the existing configuration system can provide equivalent extension implementations outside of the kernel with no incompatible changes. (MAPREDUCE-376 and MAPREDUCE-377) > Changes to the existing products, > including plans like the one Owen described, are subject to vote if anyone > disagrees with them. Is this described somewhere? The HTTPD page says, "Long term plans are simply announcements that group members are working on particular issues related to the Apache software. These are not voted on [...]." > They are also subject to veto if and only if they > are to be applied to the current release branch (or a released branch). Owen intends to merge this patch to a release branch. > The compelling reason would be a measured performance impact or some > other objective degradation of the existing product that can be > evaluated by others as a cost/benefit tradeoff and perhaps compensated > by modifying the implementation. Files written by the proposed new version would not be readable by older versions of Hadoop. An unaltered application that upgrades to the newer version would begin creating files that could not be interchanged with folks running the older version. > If a PMC member insists on making design opinion the sole basis of their > vetoes, then they are not collaborating with the rest of the PMC. The > board will recommend that such a person be removed from the PMC so that > the majority can continue to develop the product in peace. I am not the sole PMC member to express these opinions. Doug +
Doug Cutting 2010-12-07, 17:18
-
Re: [VOTE] Direction for Hadoop developmentRoy T. Fielding 2010-12-07, 22:37
On Dec 7, 2010, at 9:18 AM, Doug Cutting wrote:
> On 12/06/2010 05:09 PM, Roy T. Fielding wrote: >> Generally speaking, vetoing extension interfaces without a compelling >> technical reason is not the way Apache operates. We make extensions >> modular so that diverse collaborators can specialize according to >> their own needs, not just your needs. > > Roy, thanks for your thoughts here. > > I have not intended to veto an extension mechanism. In this case, we already have an extension mechanism. It is my understanding that the existing extension mechanism did not support the desired extensions, so improving it makes sense. > The proposal is to change the extension mechanism incompatibly with unclear benefits, Good, these are technical reasons. The benefits can be cleared by docs. By incompatible, I assume you mean forward-compatibility of old versions of Hadoop reading newer files. Can we fix that by having the new implementation use the old file format by default until it is configured to use one of the new interfaces for writing? > add implementations of several extensions to the kernel, and incompatibly change a widely-used file format. You keep referring to the kernel as if it were a product. I don't see a kernel product in the list of things released by Apache Hadoop. If there were such a product, then it would make sense for Apache Hadoop to also release ancillary products for common libraries, test frameworks, and modular storage interfaces. Rearchitecting the Hadoop product suite into such a logical arrangement would make sense, and after such an architecture is put into place then "keeping the kernel simple" would be a reason to veto a change to the kernel. > In general I support improving extension mechanisms, but oppose gratuitous changes to file formats and the inclusion of new user-level functionality in the kernel. Persistence is not usually considered user-level functionality, nor do the proposed changes seem gratuitous. Owen said the reason was to support type-safety, which may well be a desirable feature for some users. I think it makes sense to find a way to modularize the feature such that this functionality is only brought in when configured by the user. > I'd like the issue to focus solely on the extension mechanism to clarify the discussion, not on adding extensions to the kernel or file formats. That is irrelevant. Doing development via jira discussion is inherently dysfunctional because it promotes such bureaucratic nonsense instead of working towards a common solution via iterative development. My goal here is to fix this goofy behavior, not reinforce it. > Tom long ago provided patches showing how the existing configuration system can provide equivalent extension implementations outside of the kernel with no incompatible changes. (MAPREDUCE-376 and MAPREDUCE-377) They both seem to be active and unfinished. If they are equivalent fixes to the same problem, then I suggest applying them to a branch, documenting how they work, and then agreeing to have a bake-off. A bake-off is a decision made by performance and feature-completeness as an objective way to resolve an impasse due to mutually exclusive vetoes. All sides agree to drop the veto and accept whichever performs best, by majority decision. >> Changes to the existing products, >> including plans like the one Owen described, are subject to vote if anyone >> disagrees with them. > > Is this described somewhere? The HTTPD page says, "Long term plans are simply announcements that group members are working on particular issues related to the Apache software. These are not voted on [...]." All action items can be voted on. What we are talking about here is a short term plan, and it is listed as a type of action item under changes to products. >> They are also subject to veto if and only if they >> are to be applied to the current release branch (or a released branch). > > Owen intends to merge this patch to a release branch. Right. Good reason, so let's fix that. Note, however, that one valid solution is to simply release it as a new major version of the product. No, but the other objections seem to be suffering from a lack of independent thought and are predicated on the theory that outside organizations will satisfy the needs of our users instead of an Apache project solving them directly. That is extremely annoying to me, since the only reason I am here is to deal with some folks' failure to think independently of their employer. ....Roy +
Roy T. Fielding 2010-12-07, 22:37
-
Re: [VOTE] Direction for Hadoop developmentDoug Cutting 2010-12-08, 18:12
On 12/07/2010 02:37 PM, Roy T. Fielding wrote:
> Good, these are technical reasons. The benefits can be cleared by docs. > By incompatible, I assume you mean forward-compatibility of old versions > of Hadoop reading newer files. Can we fix that by having the new > implementation use the old file format by default until it is configured > to use one of the new interfaces for writing? +1 > You keep referring to the kernel as if it were a product. I don't see > a kernel product in the list of things released by Apache Hadoop. The line is fairly clear. The kernel is the daemons plus the framework code that invokes user code. The set of pluggable user implementations is fairly small: InputFormat, OutputFormat, Mapper, Reducer, RawComparator. SequenceFile was originally part of the kernel but is now only used by user-level InputFormats and OutputFormats. > If there were such a product, then it would make sense for Apache Hadoop > to also release ancillary products for common libraries, test frameworks, > and modular storage interfaces. Rearchitecting the Hadoop product suite > into such a logical arrangement would make sense, and after such an > architecture is put into place then "keeping the kernel simple" would > be a reason to veto a change to the kernel. Such a re-arrangement has been proposed but not completed. Relevant issues are MAPREDUCE-1638, MAPREDUCE-1453, and MAPREDUCE-1700. It mostly involves build issues; the architecture already largely supports the distinction. >> Tom long ago provided patches showing how the existing >> configuration system can provide equivalent extension >> implementations outside of the kernel with no incompatible changes. >> (MAPREDUCE-376 and MAPREDUCE-377) > They both seem to be active and unfinished. If they are equivalent fixes > to the same problem, then I suggest applying them to a branch, documenting > how they work, and then agreeing to have a bake-off. A bake-off is a > decision made by performance and feature-completeness as an objective > way to resolve an impasse due to mutually exclusive vetoes. All sides agree > to drop the veto and accept whichever performs best, by majority decision. A bake-off could be a good way to resolve this. Performance differences would not likely be measurable, but folks might examine user programs and consider compatibility and support implications and vote accordingly. > All action items can be voted on. What we are talking about here is a > short term plan, and it is listed as a type of action item under > changes to products. Then voting on specific short-term actions might be a good way to resolve this. Some specific short-term questions we might vote on: 1. Should we add specific versions of Protocol Buffers and Thrift to the classpath of every MapReduce program? 2. Should SequenceFile be forward-compatible, i.e., if an existing program that stores Writables in a SequenceFile is run against the new version, should the old version still be able to read the output of the new version? 3. Should we continue support a specified interchange format and/or data model for configuration data, or should configurations rather be opaque binary data? An interchange format might be JSON. An interchange data model might Map<String,Value> where values can be strings, booleans, numbers, bytes or nested configuration data, defined by a standard API that all configurable items would support. A specified format or model would permit things like using -D to set configuration options and permit generic interaction with external configuration systems. With opaque binary configurations, each configurable item would provide its own API and would require specific new code that calls this API for each parameter that could be set with -D or from an external configuration system. >>> They are also subject to veto if and only if they >>> are to be applied to the current release branch (or a released branch). >> >> Owen intends to merge this patch to a release branch. So votes on action items would be simple majority if they're not intended to be merged to a release branch, and vetoable if they are? Is that right? Doug +
Doug Cutting 2010-12-08, 18:12
-
Re: [VOTE] Direction for Hadoop developmentOwen O'Malley 2010-12-14, 03:08
On Dec 7, 2010, at 2:37 PM, Roy T. Fielding wrote: >> The proposal is to change the extension mechanism incompatibly with >> unclear benefits, > > Good, these are technical reasons. The benefits can be cleared by > docs. > By incompatible, I assume you mean forward-compatibility of old > versions > of Hadoop reading newer files. Can we fix that by having the new > implementation use the old file format by default until it is > configured > to use one of the new interfaces for writing? There are two goals here. The first is to extend the serialization plugin interface. The current patch does things completely compatibly including a shim that will use the previous plugins to satisfy the new API. The benefits are also clear. Avro serialization is possible when it wasn't previously. It also provides a wide range of opportunities that weren't previously possible. The file format was changed as a demonstration that the serialization interface was useful and complete. The file change is also backwards compatible and will automatically read old versions of the file. Old versions of the code will complain with an error message if they are given a new version. This is exactly the pattern we have used in the past. So, no there are no technical issues with the patch as it stands. > You keep referring to the kernel as if it were a product. I don't see > a kernel product in the list of things released by Apache Hadoop. The kernel is a very loosely defined concept. Utilities that are currently used by the framework are "kernel" others are just used by the users. Some classes are clearly kernel and some are clearly library, but there are some such as BooleanWritable that aren't obvious. It would take a fair amount of work and likely some duplication to segregate out the library code. I also worry that creating such a project would make Hadoop less useful out of the box and decrease the value of the Apache release of Hadoop. But back to the original point. Doug's (and Tom's) veto was based on: 1. Modification to SequenceFile. 2. It introduces a dependence on Protocol Buffers. There was strong consensus that SequenceFile was required and should be updated as the framework evolves. The second is not a technical reason. I believe that the entire veto should be considered invalid. -- Owen +
Owen O'Malley 2010-12-14, 03:08
-
Re: [VOTE] Direction for Hadoop developmentEric Sammer 2010-12-14, 04:49
On Mon, Dec 13, 2010 at 10:08 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote:
> > On Dec 7, 2010, at 2:37 PM, Roy T. Fielding wrote: > > The proposal is to change the extension mechanism incompatibly with >>> unclear benefits, >>> >> >> Good, these are technical reasons. The benefits can be cleared by docs. >> By incompatible, I assume you mean forward-compatibility of old versions >> of Hadoop reading newer files. Can we fix that by having the new >> implementation use the old file format by default until it is configured >> to use one of the new interfaces for writing? >> > > > There are two goals here. The first is to extend the serialization plugin > interface. The current patch does things completely compatibly including a > shim that will use the previous plugins to satisfy the new API. The benefits > are also clear. Avro serialization is possible when it wasn't previously. It > also provides a wide range of opportunities that weren't previously > possible. > > The file format was changed as a demonstration that the serialization > interface was useful and complete. The file change is also backwards > compatible and will automatically read old versions of the file. Old > versions of the code will complain with an error message if they are given a > new version. This is exactly the pattern we have used in the past. > > So, no there are no technical issues with the patch as it stands. One of the technical issues is the fact that this precludes users from using PB (or thrift or avro) in their jobs if the version required conflicts with what Hadoop proper has on the classpath. We've already seen these kinds of conflicts with other libraries in the wild and I would like to minimize this possibility in the future. Was there something in the patch that addressed this (I may have missed it; only did a cursory scan through)? Jumping back to the "non-technical" issue, I really think it would help to develop a course of action for resolution similar to what I suggested earlier. It doesn't need to be specifically what I suggested, but I do think that consensus building and conflict resolution are in the best interest of the community. I feel like we could debate what people said, did, meant, or the specifics of this issue for a long time. Thanks and regards. -- Eric Sammer twitter: esammer data: www.cloudera.com +
Eric Sammer 2010-12-14, 04:49
-
Re: [VOTE] Direction for Hadoop developmentOwen O'Malley 2010-12-14, 05:43
On Dec 13, 2010, at 8:49 PM, Eric Sammer wrote: > One of the technical issues is the fact that this precludes users > from using > PB (or thrift or avro) in their jobs if the version required > conflicts with > what Hadoop proper has on the classpath. This is currently true of all of our libraries and is addressed by MAPREDUCE-1938. After that is committed, users who want to override to a newer version just need to configure their job to do so. Using a new library, just because it is a serialization library other than Avro, is not an acceptable reason to veto a patch. -- Owen +
Owen O'Malley 2010-12-14, 05:43
-
Re: [VOTE] Direction for Hadoop developmentEric Sammer 2010-12-14, 07:14
On Tue, Dec 14, 2010 at 12:43 AM, Owen O'Malley <[EMAIL PROTECTED]> wrote:
> > On Dec 13, 2010, at 8:49 PM, Eric Sammer wrote: > > One of the technical issues is the fact that this precludes users from >> using >> PB (or thrift or avro) in their jobs if the version required conflicts >> with >> what Hadoop proper has on the classpath. >> > > This is currently true of all of our libraries and is addressed by > MAPREDUCE-1938. After that is committed, users who want to override to a > newer version just need to configure their job to do so. That's definitely nice. I hadn't seen that one. Using a new library, just because it is a serialization library other than > Avro, is not an acceptable reason to veto a patch. > That I'm not really qualified to say. I don't really know the rules on vetoes. But again, I'm more interested in the larger issue you raised (the subject of the thread). Part of direction for Hadoop, to me, is to get to a point where we're spending time working together. Again, I propose: - Codify (by vote) whether design plans are required or if an informal email indicating intent is sufficient, and under what circumstances. Provide examples to clarify circumstances. Solves the long term but not HADOOP-6685. - Focus the discussion on evaluation of proposals for remedying the process for conflict resolution. I know some exist, but they're drastic (removal of PMC members, for instance). - After consensus on above, focus the conversation (in another thread or on JIRA, whatever is most appropriate) on HADOOP-6685 so no one is blocked. - Put the community of users first in all areas of development and interaction. To the last point: I understand there's contention from past issues. I genuinely believe everyone has the users' interests at heart. I'm saying this as a user: this kind of contention is not in anyone's interest. We need true resolution to past issues, consensus on what the goals are and generally how to get there including how to resolve further disagreement, and only then can we jump back into the immediate issue where there is disagreement. I no longer care how corny I sound about this (and it's about to get corny). I implore all parties involved to take a long look at how we interact and to approach this with renewed respect for each other, the project, and the users. Decide to let previous cruft go and start anew. Do that by building consensus on getting out of a veto stalemate and coming up with a long term plan that makes sense to everyone. To the specific issue: Owen, would you be amenable to working to find a way to remove the PB dep in support of HADOOP-6685 and handling bootstrapping with either one of the existing deps or simple hard coded length, type, value serialization / deserialization similar to Writables? I understand your points about PB being solid, but Hadoop is already thick with deps (some of which do handle this, even if not in the preferred / most optimal format) and MR-1938 is still a ways off. Doug, is there any way to get past the objection to the SequenceFile update? It is a widely used format and is currently in Hadoop core. While I agree Hadoop should be a "kernel" as one artifact and libs as another, I think it would be less friction and cleaner to come up with a plan on how to get to that state independent of pending issues right now. It seems like maintaining backwards compat is critical to Owen et al as well and I'm sure we can come up with modifications to the patch to make it forward compat as well (if it's not already; I'm unclear on / don't remember this point). This, to me, looks like an achievable goal that doesn't compromise the functionality of HADOOP-6685 and leaves the door open to discussion of a stronger kernel / lib separation. Regards, respect, and no longer afraid of being corny on general@, -- Eric Sammer twitter: esammer data: www.cloudera.com +
Eric Sammer 2010-12-14, 07:14
-
Re: [VOTE] Direction for Hadoop developmentOwen O'Malley 2010-12-14, 19:08
On Dec 13, 2010, at 11:14 PM, Eric Sammer wrote: > - Codify (by vote) whether design plans are required or if an > informal email > indicating intent is sufficient, and under what circumstances. Provide > examples to clarify circumstances. Solves the long term but not > HADOOP-6685. We had a presentation about my plans for this jira in June and both Tom and Doug attended and asked questions. It wasn't a lack of communication. At that time, they didn't like the proposal but didn't plan to block it. In general code changes shouldn't require a vote. The goal is to work together as a community to produce code, not make everything a lawyerly argument. > Owen, would you be amenable to working to find a way to remove the > PB dep in > support of HADOOP-6685 and handling bootstrapping with either one of > the > existing deps or simple hard coded length, type, value serialization / > deserialization similar to Writables? Of course it is possible, but it is a far worse engineering solution. ProtocolBuffers do exactly what I need, it is foolish to implement an hand-crafted replacement. My point is that a dependence on Avro was accepted without issue. No one had an issue when I added snakeyaml. All of the objections are fundamentally based in a dislike of serializations other than Yarn. -- Owen +
Owen O'Malley 2010-12-14, 19:08
-
Re: [VOTE] Direction for Hadoop developmentJay Booth 2010-12-01, 16:29
>
> > For example, in your last MapReduce (MAPREDUCE-980) patch you >> added avro and paranamer as dependences. >> > > If I'm not mistaken, that only adds a dependency to the JobTracker. We > don't create specific classpaths for daemons than for user code, but we > probably should, so that things that only the daemon uses are not also > placed on the users classpath. > > +1 to separate classpaths for daemons as an eventual goal. As a user, I've definitely lost an afternoon to commons-lang version mismatches. If we can add fewer things to the Task classpath, that's fewer potential future lost afternoons. I'm not a PMC member, but I suspect I spend more time doing user-level grunt work than many PMC members, so from that perspective: On internal-to-hadoop serialization: I'm going to spend 99% of my time not caring about these formats and the other 1% of the time needing to know what's going on with them *immediately*. Right now I know nothing about protobuf.. learning new things is always great but "while my production job is broken and I'm trying to debug it" isn't really going to be the best time and place for it. JSON on the other hand is human readable and never going to change. I feel a lot safer with JSON than with any binary format, especially considering that we could all be using NewHawtUnforeseenLibrary or IncompatibleWithPreviousReleaseLibrary for our binary serialization in a couple years. On packaging serialization lib dependencies: Again, additional versioned dependencies on the Task classpath scare me, and that goes double for serialization. I could see a couple ways around it that fall prey to the inner-framework antipattern, and for what it's worth, I'd be willing to accept that additional kludginess if it meant that I wasn't strictly dependent on avro x.x or thrift y.y. What if I'm reading a file that was encoded with an incompatible version? This gets way out of scope from the immediate issue but if I could ship my own serialization library in an assembly jar, and maybe override an additional method or supply a MapOutputEncoder or something, I'd take that tradeoff over being bound to a particular version until the next version of Hadoop comes out. If there were sensible defaults in place, it might not even mean more complexity for the average job. +
Jay Booth 2010-12-01, 16:29
-
Re: [VOTE] Direction for Hadoop developmentScott Carey 2010-12-08, 03:33
On Nov 29, 2010, at 4:22 PM, Owen O'Malley wrote: > I do not support adding new dependencies to the classpath of MapReduce user >> tasks. > > > That isn't reasonable. As Hadoop evolves, we have and will continue to add > dependences. For example, in your last MapReduce (MAPREDUCE-980) patch you > added avro and paranamer as dependences. > As a non PMC member: Hadoop has already put enough stuff on the classpath to force me to make a custom build to use (in 0.19 was the start, and now no distribution can work without modification). This is because of it stuffing more and more things on the classpath. It is completely reaonable to ask that the environment that user code runs in not be polluted with libraries that are not exposed in the Hadoop API, and debate the merits of a patch based on the inclusion of an additional jar on that classpath. Webapp containers, OSGi, other classloader systems, or dependency rebasing (jarjar links, maven shade, etc) help solve this sort of mess.Even more crudely, the user's lib directory doesn't have to be Hadoop's full lib directory, and the order of inclusion of jars can help. Either way, if Hadoop wants to be an application execution framework, it can't just throw whatever it wants on the classpath forever. If one wants to provide lots of tools as part of a rich environment for users, the user has to either be able to easily _opt in_ to having those tools available on their class path or _opt out_ of having them there. Now, this is really a tangent to other issues at hand. I'd like to suggest that rather than point fingers at who added what to what classpath and when, it is just noted that classpath management is a problem that Hadoop needs to solve and not ignore. I'm pretty sure there's a JIRA on it somewhere already. Until it is solved to some degree (since on a scale of 1 to 10 dealing with classpath collisions, Hadoop is currently somewhere between 0 and 1), its going to limit what can be built without causing user applications to break on an upgrade. Whether those new features are good or bad on its own merits is being conflated with classpath problems that it introduces for users. > -- Owen +
Scott Carey 2010-12-08, 03:33
-
Re: [VOTE] Direction for Hadoop developmentKonstantin Shvachko 2010-12-01, 01:57
This sounds like an important issue. But I personally don't understand what
exactly the controversy is, and therefore what is this vote about, and what are the choices, if any. What I understand is that the issue spans over at least two (long) issues and different discussion threads. Could somebody knowledgeable make an independent digest of what is going on and how it stands. I am probably not alone who struggles with this. Is there a simple answer, like "you vetoed yesterday - now it's my turn" or "Avro should/not hold the monopoly for Hadoop serialization"? These were just humorous examples. Thanks, --Konstantin On Mon, Nov 29, 2010 at 2:30 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote: > All, > Based on the discussion on HADOOP-6685, there is a pretty fundamental > difference of opinion about how Hadoop should evolve. We need to figure out > how the majority of the PMC wants the project to evolve to understand which > patches move us forward. Please vote whether you approve of the following > direction. Clearly as the author, I'm +1. > > -- Owen > > Hadoop has always included library code so that users had a strong > foundation to build their applications on without needing to continually > reinvent the wheel. This combination of framework and powerful library code > is a common pattern for successful projects, such as Java, Lucene, etc. > Toward that end, we need to continue to extend the Hadoop library code and > actively maintain it as the framework evolves. Continuing support for > SequenceFile and TFile, which are both widely used is mandatory. The > opposite pattern of implementing the framework and letting each distribution > add the required libraries will lead to increased community fragmentation > and vendor lock in. > > Hadoop's generic serialization framework had a lot of promise when it was > introduced, but has been hampered by a lack of plugins other than Writables > and Java serialization. Supporting a wide range of serializations natively > in Hadoop will give the users new capabilities. Currently, to support Avro > or ProtoBuf objects mutually incompatible third party solutions are > required. It benefits Hadoop to support them with a common framework that > will support all of them. In particular, having easy, out of the box support > for Thrift, ProtoBufs, Avro, and our legacy serializations is a desired > state. > > As a distributed system, there are many instances where Hadoop needs to > serialize data. Many of those applications need a lightweight, versioned > serialization framework like ProtocolBuffers or Thrift and using them is > appropriate. Adding dependences on Thrift and ProtocolBuffers to the > previous dependence on Avro is acceptable. +
Konstantin Shvachko 2010-12-01, 01:57
-
Re: [VOTE] Direction for Hadoop developmentOwen O'Malley 2010-12-01, 19:11
On Nov 30, 2010, at 5:57 PM, Konstantin Shvachko wrote: > This sounds like an important issue. But I personally don't > understand what > exactly the controversy is, and therefore what is this vote about, > and what > are the choices, if any. The question is how the Hadoop project wants to move forward. It was motivated by Doug's veto of HADOOP-6685, which was based on his personal decisions about how the project should go forward and not on anything that had been decided by the PMC. These decisions are much more important to MapReduce, which is a framework, than HDFS which is a client/server model. 1. Should Hadoop include a user-facing library of useful code? There has been a suggestion that user-facing library code, such as SequenceFile, TFile, DistCp, etc. should be deprecated and that Hadoop should allow third party projects like Avro to supply the user-facing library code that makes Hadoop usable. I think it is critical that we keep those components as part of Hadoop and extend them as the framework evolves. Users depend heavily on SequenceFile for storing their data in Hadoop and they should not be deprecated as Doug has suggested. 2. Should MapReduce support non-Writables through the pipeline out of the box? There has also been a discussion about whether we should support non- Writables natively. There is already library code in Avro that lets users use Avro types in a custom MapReduce API. A general MapReduce API that encompasses all of the serialization frameworks and does not lock users into a particular one is much more powerful. Furthermore, making it convenient for the users, by including the plugins in the default configuration and class path, will enable the use of Avro, Thrift and ProtoBuf objects by people who would rather not focus on serialization. Avro and Writables should not be the only first class serializations that Hadoop supports by default. 3. Should a framework dependency on ProtoBuf be allowed? Doug has added several framework dependences on Avro. The question is whether it is acceptable to use the ProtoBuf library in the framework. Avro is good for uses where there are a lot of objects of the same type. ProtoBuf is better for small number of objects. The question is whether Avro, JSON, and XML should be the only serialization libraries that are acceptable to use in the framework. -- Owen +
Owen O'Malley 2010-12-01, 19:11
-
Re: [VOTE] Direction for Hadoop developmentOwen O'Malley 2010-12-06, 17:16
On Dec 1, 2010, at 11:11 AM, Owen O'Malley wrote: All, We really need some guidance on the general direction for the project. Please comment and/or vote. If no one cares, then I'll probably commit it to Yahoo's internal branch. -- Owen > The question is how the Hadoop project wants to move forward. > > It was motivated by Doug's veto of HADOOP-6685, which was based on > his personal decisions about how the project should go forward and > not on anything that had been decided by the PMC. > > These decisions are much more important to MapReduce, which is a > framework, than HDFS which is a client/server model. > > 1. Should Hadoop include a user-facing library of useful code? > > There has been a suggestion that user-facing library code, such as > SequenceFile, TFile, DistCp, etc. should be deprecated and that > Hadoop should allow third party projects like Avro to supply the > user-facing library code that makes Hadoop usable. I think it is > critical that we keep those components as part of Hadoop and extend > them as the framework evolves. Users depend heavily on SequenceFile > for storing their data in Hadoop and they should not be deprecated > as Doug has suggested. > > 2. Should MapReduce support non-Writables through the pipeline out > of the box? > > There has also been a discussion about whether we should support non- > Writables natively. There is already library code in Avro that lets > users use Avro types in a custom MapReduce API. A general MapReduce > API that encompasses all of the serialization frameworks and does > not lock users into a particular one is much more powerful. > > Furthermore, making it convenient for the users, by including the > plugins in the default configuration and class path, will enable the > use of Avro, Thrift and ProtoBuf objects by people who would rather > not focus on serialization. Avro and Writables should not be the > only first class serializations that Hadoop supports by default. > > 3. Should a framework dependency on ProtoBuf be allowed? > > Doug has added several framework dependences on Avro. The question > is whether it is acceptable to use the ProtoBuf library in the > framework. Avro is good for uses where there are a lot of objects of > the same type. ProtoBuf is better for small number of objects. The > question is whether Avro, JSON, and XML should be the only > serialization libraries that are acceptable to use in the framework. +
Owen O'Malley 2010-12-06, 17:16
-
Re: [VOTE] Direction for Hadoop developmentChris Douglas 2010-12-06, 18:40
SequenceFile is an experimental file format; as long as it continues
to support existing data, I see no reason to block its continued evolution. TFile is also part of the common project. The project could support still more, provided the tools and documentation were available to help users select the one that fits their use case. This question is backwards. If the assertion is that a part of the framework's development should be arrested, that claim requires a discussion and vote. The PMC should not have to weigh in on allowing code to change. -C On Mon, Dec 6, 2010 at 9:16 AM, Owen O'Malley <[EMAIL PROTECTED]> wrote: > > On Dec 1, 2010, at 11:11 AM, Owen O'Malley wrote: > > All, > We really need some guidance on the general direction for the project. > Please comment and/or vote. If no one cares, then I'll probably commit it to > Yahoo's internal branch. > > -- Owen > >> The question is how the Hadoop project wants to move forward. >> >> It was motivated by Doug's veto of HADOOP-6685, which was based on his >> personal decisions about how the project should go forward and not on >> anything that had been decided by the PMC. >> >> These decisions are much more important to MapReduce, which is a >> framework, than HDFS which is a client/server model. >> >> 1. Should Hadoop include a user-facing library of useful code? >> >> There has been a suggestion that user-facing library code, such as >> SequenceFile, TFile, DistCp, etc. should be deprecated and that Hadoop >> should allow third party projects like Avro to supply the user-facing >> library code that makes Hadoop usable. I think it is critical that we keep >> those components as part of Hadoop and extend them as the framework evolves. >> Users depend heavily on SequenceFile for storing their data in Hadoop and >> they should not be deprecated as Doug has suggested. >> >> 2. Should MapReduce support non-Writables through the pipeline out of the >> box? >> >> There has also been a discussion about whether we should support >> non-Writables natively. There is already library code in Avro that lets >> users use Avro types in a custom MapReduce API. A general MapReduce API that >> encompasses all of the serialization frameworks and does not lock users into >> a particular one is much more powerful. >> >> Furthermore, making it convenient for the users, by including the plugins >> in the default configuration and class path, will enable the use of Avro, >> Thrift and ProtoBuf objects by people who would rather not focus on >> serialization. Avro and Writables should not be the only first class >> serializations that Hadoop supports by default. >> >> 3. Should a framework dependency on ProtoBuf be allowed? >> >> Doug has added several framework dependences on Avro. The question is >> whether it is acceptable to use the ProtoBuf library in the framework. Avro >> is good for uses where there are a lot of objects of the same type. ProtoBuf >> is better for small number of objects. The question is whether Avro, JSON, >> and XML should be the only serialization libraries that are acceptable to >> use in the framework. > > +
Chris Douglas 2010-12-06, 18:40
-
Re: [VOTE] Direction for Hadoop developmentArun C Murthy 2010-12-06, 18:46
On Dec 6, 2010, at 10:40 AM, Chris Douglas wrote: > > This question is backwards. If the assertion is that a part of the > framework's development should be arrested, that claim requires a > discussion and vote. The PMC should not have to weigh in on allowing > code to change. -C > Agreed. Arresting development on SequenceFile is preposterous. There are several petabytes of data sitting on it all over for several reasons, including legacy. Stopping development on it is unreasonable. Apache Hadoop is volunteer driven, volunteers should be allowed to contribute as they see fit. +1 Arun > On Mon, Dec 6, 2010 at 9:16 AM, Owen O'Malley <[EMAIL PROTECTED]> > wrote: >> >> On Dec 1, 2010, at 11:11 AM, Owen O'Malley wrote: >> >> All, >> We really need some guidance on the general direction for the >> project. >> Please comment and/or vote. If no one cares, then I'll probably >> commit it to >> Yahoo's internal branch. >> >> -- Owen >> >>> The question is how the Hadoop project wants to move forward. >>> >>> It was motivated by Doug's veto of HADOOP-6685, which was based on >>> his >>> personal decisions about how the project should go forward and not >>> on >>> anything that had been decided by the PMC. >>> >>> These decisions are much more important to MapReduce, which is a >>> framework, than HDFS which is a client/server model. >>> >>> 1. Should Hadoop include a user-facing library of useful code? >>> >>> There has been a suggestion that user-facing library code, such as >>> SequenceFile, TFile, DistCp, etc. should be deprecated and that >>> Hadoop >>> should allow third party projects like Avro to supply the user- >>> facing >>> library code that makes Hadoop usable. I think it is critical that >>> we keep >>> those components as part of Hadoop and extend them as the >>> framework evolves. >>> Users depend heavily on SequenceFile for storing their data in >>> Hadoop and >>> they should not be deprecated as Doug has suggested. >>> >>> 2. Should MapReduce support non-Writables through the pipeline out >>> of the >>> box? >>> >>> There has also been a discussion about whether we should support >>> non-Writables natively. There is already library code in Avro that >>> lets >>> users use Avro types in a custom MapReduce API. A general >>> MapReduce API that >>> encompasses all of the serialization frameworks and does not lock >>> users into >>> a particular one is much more powerful. >>> >>> Furthermore, making it convenient for the users, by including the >>> plugins >>> in the default configuration and class path, will enable the use >>> of Avro, >>> Thrift and ProtoBuf objects by people who would rather not focus on >>> serialization. Avro and Writables should not be the only first class >>> serializations that Hadoop supports by default. >>> >>> 3. Should a framework dependency on ProtoBuf be allowed? >>> >>> Doug has added several framework dependences on Avro. The question >>> is >>> whether it is acceptable to use the ProtoBuf library in the >>> framework. Avro >>> is good for uses where there are a lot of objects of the same >>> type. ProtoBuf >>> is better for small number of objects. The question is whether >>> Avro, JSON, >>> and XML should be the only serialization libraries that are >>> acceptable to >>> use in the framework. >> >> +
Arun C Murthy 2010-12-06, 18:46
-
Re: [VOTE] Direction for Hadoop developmentTom White 2010-12-06, 21:14
On Mon, Dec 6, 2010 at 9:16 AM, Owen O'Malley <[EMAIL PROTECTED]> wrote:
> > All, > We really need some guidance on the general direction for the project. > Please comment and/or vote. If no one cares, then I'll probably commit it to > Yahoo's internal branch. > An alternative to this would be to create a project for serializations. Other projects have successfully provided external libraries for Hadoop, e.g. Pig, Hive, Elephant Bird, Plume, Cascading, Mahout etc. This would make it possible for users to choose which library they wanted to use, and allows updates to the library to be driven by the library's release schedule, not Hadoop's. Tom +
Tom White 2010-12-06, 21:14
-
Re: [VOTE] Direction for Hadoop developmentKonstantin Shvachko 2010-12-07, 11:27
It really takes time to understand the issue. I will spend more time reading
through it. So far I feel that we need to distinguish between a) issues that define the general direction for the project, and b) the specifics of the implementation proposed by Owen, including decisions induced by that implementation. The main contradictory issue on which Owen and Doug disagree (other people as well) is whether Hadoop should support multiple serializations or be based on one designated serialization. This is a defining general direction a-issue. I believe this is vote-able. The question of introducing dependency on ProtoBuf is a b-issue, as it can be implemented differently. Say with "pluggable" APIs as Tom proposed. This is probably a consensus-type issue. Looks to me if we decide on multiple vs designated serializations, some b-issues may be automatically ruled out or in. Thanks, --Konstantin P.S. We used to have a tradition of presenting design documents before introducing such big changes. I believe a discussion of a design doc would have reduced tensions we face now. On Mon, Dec 6, 2010 at 9:16 AM, Owen O'Malley <[EMAIL PROTECTED]> wrote: > > On Dec 1, 2010, at 11:11 AM, Owen O'Malley wrote: > > All, > We really need some guidance on the general direction for the project. > Please comment and/or vote. If no one cares, then I'll probably commit it to > Yahoo's internal branch. > > -- Owen > > The question is how the Hadoop project wants to move forward. >> >> It was motivated by Doug's veto of HADOOP-6685, which was based on his >> personal decisions about how the project should go forward and not on >> anything that had been decided by the PMC. >> >> These decisions are much more important to MapReduce, which is a >> framework, than HDFS which is a client/server model. >> >> 1. Should Hadoop include a user-facing library of useful code? >> >> There has been a suggestion that user-facing library code, such as >> SequenceFile, TFile, DistCp, etc. should be deprecated and that Hadoop >> should allow third party projects like Avro to supply the user-facing >> library code that makes Hadoop usable. I think it is critical that we keep >> those components as part of Hadoop and extend them as the framework evolves. >> Users depend heavily on SequenceFile for storing their data in Hadoop and >> they should not be deprecated as Doug has suggested. >> >> 2. Should MapReduce support non-Writables through the pipeline out of the >> box? >> >> There has also been a discussion about whether we should support >> non-Writables natively. There is already library code in Avro that lets >> users use Avro types in a custom MapReduce API. A general MapReduce API that >> encompasses all of the serialization frameworks and does not lock users into >> a particular one is much more powerful. >> >> Furthermore, making it convenient for the users, by including the plugins >> in the default configuration and class path, will enable the use of Avro, >> Thrift and ProtoBuf objects by people who would rather not focus on >> serialization. Avro and Writables should not be the only first class >> serializations that Hadoop supports by default. >> >> 3. Should a framework dependency on ProtoBuf be allowed? >> >> Doug has added several framework dependences on Avro. The question is >> whether it is acceptable to use the ProtoBuf library in the framework. Avro >> is good for uses where there are a lot of objects of the same type. ProtoBuf >> is better for small number of objects. The question is whether Avro, JSON, >> and XML should be the only serialization libraries that are acceptable to >> use in the framework. >> > > +
Konstantin Shvachko 2010-12-07, 11:27
-
Re: [VOTE] Direction for Hadoop developmentDoug Cutting 2010-12-07, 17:22
On 12/07/2010 03:27 AM, Konstantin Shvachko wrote:
> The main contradictory issue on which Owen and Doug disagree (other people > as well) is whether > Hadoop should support multiple serializations or be based on one designated > serialization. > This is a defining general direction a-issue. I believe this is vote-able. I no longer think we should add any new serialization implementations to the kernel. We might provide implementations as separate libraries that folks choose to use, but we should work to make sure that user code is well distinguished from the kernel and also try not to pollute the users classpath with particular versions of popular libraries. Doug +
Doug Cutting 2010-12-07, 17:22
-
Re: [VOTE] Direction for Hadoop developmentKonstantin Shvachko 2010-12-07, 18:26
> I no longer think we should add any new serialization implementations to
the kernel. Not clear. Do you propose to keep current serialization(s) and not add new ones? Or do you propose to replace current serialization by abstract interfaces and move implementations to libraries? --Konstantin On Tue, Dec 7, 2010 at 9:22 AM, Doug Cutting <[EMAIL PROTECTED]> wrote: > On 12/07/2010 03:27 AM, Konstantin Shvachko wrote: > >> The main contradictory issue on which Owen and Doug disagree (other people >> as well) is whether >> Hadoop should support multiple serializations or be based on one >> designated >> serialization. >> This is a defining general direction a-issue. I believe this is vote-able. >> > > I no longer think we should add any new serialization implementations to > the kernel. We might provide implementations as separate libraries that > folks choose to use, but we should work to make sure that user code is well > distinguished from the kernel and also try not to pollute the users > classpath with particular versions of popular libraries. > > Doug > +
Konstantin Shvachko 2010-12-07, 18:26
-
Re: [VOTE] Direction for Hadoop developmentDoug Cutting 2010-12-08, 18:55
On 12/07/2010 10:26 AM, Konstantin Shvachko wrote:
>> I no longer think we should add any new serialization implementations to > the kernel. > > Not clear. Do you propose to keep current serialization(s) and not add new > ones? > Or do you propose to replace current serialization by abstract interfaces > and move implementations to libraries? We can't move existing serialization implementations to an optional library without breaking compatibility. Long-term that might be nice, but I am not proposing that short-term. Short-term I propose we avoid adding new serialization implementations to the default classpath, especially those that add new dependencies to every task. Long-term we might split library code into perhaps a few categories: - mandatory: this might include, e.g., IdentityMapper and IdentityReducer, the default implementations. - back-compatible: the collection of library components that were provided on the default classpath and can be enabled for back-compatible behavior. - optional: components that jobs can optionally depend on. This is where new components that are not mandatory would be added. Doug +
Doug Cutting 2010-12-08, 18:55
-
Re: [VOTE] Direction for Hadoop developmentSteve Loughran 2010-12-01, 12:25
On 29/11/10 22:30, Owen O'Malley wrote:
> All, > Based on the discussion on HADOOP-6685, there is a pretty fundamental > difference of opinion about how Hadoop should evolve. We need to figure > out how the majority of the PMC wants the project to evolve to > understand which patches move us forward. Please vote whether you > approve of the following direction. Clearly as the author, I'm +1. > > -- Owen > > Hadoop has always included library code so that users had a strong > foundation to build their applications on without needing to continually > reinvent the wheel. This combination of framework and powerful library > code is a common pattern for successful projects, such as Java, Lucene, > etc. Toward that end, we need to continue to extend the Hadoop library > code and actively maintain it as the framework evolves. Continuing > support for SequenceFile and TFile, which are both widely used is > mandatory. The opposite pattern of implementing the framework and > letting each distribution add the required libraries will lead to > increased community fragmentation and vendor lock in. > > Hadoop's generic serialization framework had a lot of promise when it > was introduced, but has been hampered by a lack of plugins other than > Writables and Java serialization. Supporting a wide range of > serializations natively in Hadoop will give the users new capabilities. > Currently, to support Avro or ProtoBuf objects mutually incompatible > third party solutions are required. It benefits Hadoop to support them > with a common framework that will support all of them. In particular, > having easy, out of the box support for Thrift, ProtoBufs, Avro, and our > legacy serializations is a desired state. > > As a distributed system, there are many instances where Hadoop needs to > serialize data. Many of those applications need a lightweight, versioned > serialization framework like ProtocolBuffers or Thrift and using them is > appropriate. Adding dependences on Thrift and ProtocolBuffers to the > previous dependence on Avro is acceptable. I'm happy with new build-time dependencies on these libraries, with one big warning. Until an official, non-incubation release of Thrift comes out (and thrift moves from incubation), the Apache Management will veto any redistribution of the thrift JARs; they aren't signed off as for public use. I'm not so sure about more runtime depencencies that go all the way into the classpath of the things working with HDFS, or files created in it, because that leads to version problems in private code. [Inevitably Hadoop will end up adopting for some OSGi-like classpath setup, but I'm not pushing for that as it has its own interesting issues]. At the same time -you can't add features without adding dependencies except by playing rebasing tricks, and I have mixed feelings about those tricks: good: lets the hadoop team push things out on their schedule bad: impossible to push out security bug fixes to dependent libraries without rebuilding and re-releasing things. Your ops team will hate you. For the bad reason, and because it's extra work, I avoid playing rebasing games, just try and do classpaths right in the first place -which is easier said than done. One part of the HADOOP-6685 discussion raised was JSON as a format for things. Adopting JSON -and deciding which JSON parser to use- is trouble. Ignoring the ongoing discussion of serialization formats, the question "should we use JSON?" really leads back to "which external JSON parser do we want to use?", which is a separate -and significant problem. I say this as someone who has three separate json parsers on the runtime classpath of something whose functional tests are failing in a hudson window blinking at me alongside this email application. gson: http://code.google.com/p/google-gson/ http://mvnrepository.com/artifact/com.google.code.gson/gson/1.4 com.google.code.gson/gson-1.5.1; no runtime dependencies -some people like the seamless binding to java objects, which I view as repeating the same mistakes as WS-*. json-lib: http://json-lib.sourceforge.net/ http://mvnrepository.com/artifact/net.sf.json-lib/json-lib/2.3 at runtime tends to need the usual commons-logging back end and net.sf.json-lib/json-lib-2.3 net.sf.ezmorph/ezmorph-1.06 commons-lang-2.4 commons-collections-3.2.1 -low level, DOM-ish, could be improved to be more Java-5-intuitive Jackson: http://jackson.codehaus.org org.codehaus.jackson/jackson-core-asl-1.6.2 org.codehaus.jackson/jackson-asl/0.9.5 Now, before someone points out that three JSON parsers is too many, this same code has log4J, SLF4J (with a back end to JSCL), a patched back end logger for Jetty to avoid SLF4J where possible, and a custom JCL back-end. XML side there's xerces and xalan instead of the JVM versions, and hibernate pulling in dom4j alongside. Test runs add htlmunit to the classpath, which pulls in the older httpclient libs, along with the http-core stuff I've switched to. Java library versions -while more manageable than native library versions- are a pain. Regardless of the ugliness of XML or the mediocrity of DOM, running over to JSON just because DOM is unwieldy is replacing one source of trouble for another. If Hadoop is going to use JSON in places, then the discussion/decision about which JSON parser to stick on the classpath is worthy of a JIRA issue all of its own. -steve (returning to his failing tests) +
Steve Loughran 2010-12-01, 12:25
-
Re: [VOTE] Direction for Hadoop developmentEric Sammer 2010-12-07, 03:36
I'm going to rather purposefully ignore larger questions like how the
ASF works or doesn't, veto usage, etc. I'm not well versed enough in the Apache way to weigh in. As someone who sees a lot of Hadoop clusters at many different companies, I would like to see Hadoop's serialization system(s) change. I think Hadoop should support interfaces to control serialization plugin lifecycle and element serialization to / from an abstract notion of a datum and bytes only. I would like to not mention a serialization implementation by name in Hadoop proper, at all. A single implementation to serve as a reference implementation makes sense. To preserve backward compatibility and existing investment, it makes sense for that to be Writable (whether we like it or not). Additional implementations should be either "contrib" status (if that's still an option) or externally managed (probably preferred due to release cycle synchronization / update issues). The default classpath should remain as free of mandatory external dependencies as possible; library conflicts are still an extremely sore spot in Java development at many sites I visit and forcing a large commercial entity to use version X o something like Avro, Thrift, PB is almost a non-starter for many. If a PB / Thrift / Avro serialization implementation is part of contrib or externally managed, it requires the user to understand this dependency exists and manage the classpath. The precedent in my mind is the scheduler situation; most folks run with either the cap or fair schedulers but FIFO provides a default. If you opt to use one and it comes with dependencies, that's your business. I think we can simplify serialization plugin configuration via a classpath include system by using something like run-parts or similar and the current configuration system, but that's another issue. In absence of an "opt in" serialization configuration pattern, we must at least provide an "opt out." If a user uses thrift for their own MR jobs internally, we shouldn't throw a monkey wrench into their life by demanding it for core Hadoop. Provide them a means to de-configure built in serialization impls and remove thrift from the classpath. I'm a bit confused as to how this equates with sequence files being deprecated or arrested. I tried to read HADOOP-6685 but there's a lot of internal references and context I feel like I'm missing. Suffice it to say, sequence files can *not* be broken for existing data for the reasons everyone has stated. If we choose to focus development elsewhere ("soft deprecate") or actively encourage users elsewhere ("@Deprecated") is an issue I think we can sever from this discussion. tl;dr version: - Don't break existing SequenceFiles. - Serialization should be a richer interface to support plugin lifecycle, serialize / deserialize only and be retrofitted using PB, Avro, and Thrift as immediate consumer use cases. Serialization APIs should be promoted to a(n officially) public, documented, API suited to deal with modern serialization lib requirements. - Common, HDFS, MR should contain as few mandatory external deps as humanly possible because Java classloader semantics and a lack of internal dep isolation is just kookoo for cocoa puffs. (Simplify it and bring on our OSGI overlords.) - We (non-committers / users / casual contributors) want only for Hadoop to mature in features and stability, be an inviting community to new potential contributors and users, and to be around for a long time. Regards, respect, and thanks to all. On Mon, Nov 29, 2010 at 5:30 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote: > All, > Based on the discussion on HADOOP-6685, there is a pretty fundamental > difference of opinion about how Hadoop should evolve. We need to figure out > how the majority of the PMC wants the project to evolve to understand which > patches move us forward. Please vote whether you approve of the following > direction. Clearly as the author, I'm +1. > > -- Owen > > Hadoop has always included library code so that users had a strong Eric Sammer twitter: esammer data: www.cloudera.com +
Eric Sammer 2010-12-07, 03:36
-
Re: [VOTE] Direction for Hadoop developmentOwen O'Malley 2010-12-07, 08:13
On Dec 6, 2010, at 7:36 PM, Eric Sammer wrote: Eric, Since this is mostly technical, it probably should be on the h-6685 jira instead of general@hadoop. > I think Hadoop should support interfaces to control > serialization plugin lifecycle and element serialization to / from an > abstract notion of a datum and bytes only. The core of my h-6685 patch updates the API to replace the typename with a serialization name and serialization specific metadata. That metadata is a set of bytes that are defined by the serialization. The typename alone is insufficient for Avro and having additional metadata will be useful for the other serializations as well. Doug suggested that I add a user-friendly pair of methods and I did. While they are redundant, the set of serializations isn't expected to be large and therefore the extra code isn't much. > I would like to not mention > a serialization implementation by name in Hadoop proper, at all. My patch removes some of the lingering references to Writables in SequenceFile, MapFile, etc. and moves them over the generic serialization API. The framework will likely continue depend on whichever serialization is used for RPC. Currently that is Writables, but will likely transition to either Avro or ProtoBuf in the coming year. > A > single implementation to serve as a reference implementation makes > sense. A critical part of Hadoop's usability comes from its framework combined with library code that allows users to get the desired functionality without writing it themselves. Sure, it is easy to write a hash table yourself, but it is far easier to use the one bundled with Java. > The default > classpath should remain as free of mandatory external dependencies as > possible; library conflicts are still an extremely sore spot in Java > development at many sites I visit and forcing a large commercial > entity to use version X o something like Avro, Thrift, PB is almost a > non-starter for many. I discussed this problem in the jira, but either the MapReduce user is using the X library or doesn't care the version of X. If they are using it, it is far more convenient to have the serialization on the classpath. There is a missing feature that we need to address to put the user's files ahead of the system ones. I'll file a jira for that. It might also make sense for us to shade some of our dependencies, but that is a much bigger issue and is far from clear cut. > If a PB / Thrift / Avro serialization implementation is part of > contrib or externally managed, it requires the user to understand this > dependency exists and manage the classpath. The goal is to make Hadoop useful out of the box. If we make it so that Hadoop is only useful once it is bundled with 15 other projects, that is good for people who sell distributions that include Hadoop, but not for the project. > I think we can simplify > serialization plugin configuration via a classpath include system by > using something like run-parts or similar and the current > configuration system, but that's another issue. The current patch loads the serialization plugins based on the configuration. If you don't want to support thrift, don't configure it. The same holds true of the other serializations, even writable. > I'm a bit confused as to how this equates with sequence files being > deprecated or arrested. Doug vetoed my patch partially based on his assertion that SequenceFiles should be deprecated and that Hadoop should just be the framework with no library code. > If we choose to focus development > elsewhere ("soft deprecate") or actively encourage users elsewhere > ("@Deprecated") is an issue I think we can sever from this discussion. At this point the PMC has supported continuing to invest in developing SequenceFiles. > - Don't break existing SequenceFiles. That goes without saying, everyone has petabytes of data in them. > - Common, HDFS, MR should contain as few mandatory external deps as That is a much bigger discussion that we should probably have. There are costs on both sides in terms of debugging and understandability. In particular, in most cases we are much better off using a library that has the right functionality that re-implementing it ourselves. I want that too. +
Owen O'Malley 2010-12-07, 08:13
-
Re: [VOTE] Direction for Hadoop developmentJeff Hammerbacher 2010-12-07, 10:23
>
> A critical part of Hadoop's usability comes from its framework combined > with library code that allows users to get the desired functionality without > writing it themselves. > > The goal is to make Hadoop useful out of the box. > To the best of my knowledge, Owen, your organization requires users to petition a committee before writing MapReduce jobs. At Facebook, the vast majority of jobs are submitted via Hive. Our customers at Cloudera primarily consume MapReduce through Pig, Hive, and other high-level tools. Users of Hadoop have moved beyond MapReduce. The community would be far better served by a compact, reliable, and efficient kernel. That's the project direction Doug has suggested for MapReduce, and it's one that Eric and Tom have supported. I also support this direction for the project. We're clearly having a hard time, as a community, agreeing on standards for library code. We've also shipped updates to the framework without updating the library code, seriously damaging the usability of the project. In this discussion, we're prioritizing the rapidly shrinking proportion of users of MapReduce library code in favor of the far larger community of consumers of the framework. Arun recently asked on Quora about issues that users face with Hadoop MapReduce: http://qr.ae/pPNK. There are currently five issues brought up there, with 19 votes for those issues; none of them are addressed directly by this extended debate. I'd be ecstatic to see this discussion result in moving the file formats, input and output formats, and other library code out to a separate Apache project or Github where they can evolve rapidly based on user needs, so that the MapReduce project can begin to address some of the outstanding issues with the framework itself. HDFS, HBase, Hive, Pig, Oozie, and other Hadoop-related projects continue to make forward progress at a remarkable rate; I'd like to see MapReduce return to health as well. Clearing away these major sources of conflict seems like one promising path forward. So, I'm not on the PMC, but I'm -1 on the proposed vote. +
Jeff Hammerbacher 2010-12-07, 10:23
-
Re: [VOTE] Direction for Hadoop developmentArun C Murthy 2010-12-07, 16:12
On Dec 7, 2010, at 2:23 AM, Jeff Hammerbacher wrote: > To the best of my knowledge, Owen, your organization requires users to > petition a committee before writing MapReduce jobs. I'd appreciate if we could keep the discussion technical and did not resort to snide comments. Thank you. > Users of Hadoop have moved beyond MapReduce. The community would be > far > better served by a compact, reliable, and efficient kernel. That's the > project direction Doug has suggested for MapReduce, and it's one > that Eric > and Tom have supported. I also support this direction for the project. > This is a great discussion to have, if Doug could start it, rather than put forward his word as the law. However, this is not germane to the discussion at hand. The discussion at hand is simple: Doug has vetoed this patch for 2 reasons: a) dependency on PB b) extension to SequenceFile a) is technical, b) isn't. This discussion is about b). > I'd be ecstatic to see this discussion result in moving the file > formats, > input and output formats, and other library code out to a separate > Apache > project or Github where they can evolve rapidly based on user needs, > so that > the MapReduce project can begin to address some of the outstanding > issues > with the framework itself. Again, no one is proposing new file formats here. SequenceFile is an important file format for several reasons: > - It's been bundled with Hadoop for nearly 5 years now - Several users store petabytes of data on it Blocking extensions to SequenceFile is unreasonable as has been noted by several folks, there is no *technical* reason to do that. People are welcome to start any number of file-formats and input/ output libraries either in Apache or outside, no one is proposing otherwise. Arun +
Arun C Murthy 2010-12-07, 16:12
-
Re: [VOTE] Direction for Hadoop developmentDoug Cutting 2010-12-07, 17:26
On 12/07/2010 08:12 AM, Arun C Murthy wrote:
> Blocking extensions to SequenceFile is unreasonable as has been noted by > several folks, there is no *technical* reason to do that. The change to SequenceFile is incompatible with older versions of Hadoop. It changes the file's version number so that older versions will not be able to read data written by newer versions. This is a technical issue. Doug +
Doug Cutting 2010-12-07, 17:26
-
Re: [VOTE] Direction for Hadoop developmentOwen O'Malley 2010-12-07, 18:25
On Dec 7, 2010, at 9:26 AM, Doug Cutting wrote: > On 12/07/2010 08:12 AM, Arun C Murthy wrote: >> Blocking extensions to SequenceFile is unreasonable as has been >> noted by >> several folks, there is no *technical* reason to do that. > > The change to SequenceFile is incompatible with older versions of > Hadoop. It changes the file's version number so that older versions > will not be able to read data written by newer versions. This is a > technical issue. The new code reads the new or old versions of SequenceFile seamlessly using auto-detection of the version. The old code fails with an explicit message saying that it can't read this version. This is the only mechanism available when upgrading a file format with a single version number and is the mechanism that we've used 6 times in the past. If we'd used ProtocolBuffers for the SequenceFile header, we'd have more options for backwards compatibility, but we didn't. -- Owen +
Owen O'Malley 2010-12-07, 18:25
-
Re: [VOTE] Direction for Hadoop developmentDoug Cutting 2010-12-08, 19:20
On 12/07/2010 10:25 AM, Owen O'Malley wrote:
> The new code reads the new or old versions of SequenceFile seamlessly > using auto-detection of the version. The old code fails with an explicit > message saying that it can't read this version. This is the only > mechanism available when upgrading a file format with a single version > number and is the mechanism that we've used 6 times in the past. The last such change was nearly four years ago, in: https://issues.apache.org/jira/browse/HADOOP-732 The quantity of data stored in SequenceFiles has greatly increased over the past four years. The project's concern for compatibility has also correspondingly increased over that time. The new format version might not be written when folks are using Writable or some other serialization currently supported by SequenceFile. The only situation in your patch where the new version is required is for Avro. You might simply drop support for Avro and leave the file version number alone since Avro already includes a container file format. Or you might only use the new format version for non-class-determined serializations like Avro. Or you might use SequenceFile's existing metadata for non-class-determined serializations like Avro and leave the file version number alone. Doug +
Doug Cutting 2010-12-08, 19:20
-
Re: [VOTE] Direction for Hadoop developmentEric Sammer 2010-12-07, 18:08
Thanks for your response Owen. I'll common on the JIRA with my
opinion. I didn't want to muddy the existing conversation, but if it helps to have user level input, I'm happy to throw my hat in the ring. Just the summary version this time: Non-technical: - I believe we need to temper our goals of stability with the need for growth and improvement. The project should be free to innovate. We all agree on this. How we do that is the question to me. We should take a (brief) step back to make a decision on that. - We should reevaluate how most people view and are using Hadoop to help us make these decisions. For instance, do people see Hadoop as a turn key system that includes everything required or do they view it as a framework for building custom data systems? What I've seen and believe is that it's more the latter and having some "after market" customization is normal. The community / ASF spinning off projects like Pig, Hive, ZK, Chukwa, and others reinforce this in my mind; these are not bits of Hadoop proper, but natural extensions with their own development path and release schedules. - No one benefits from Hadoop being difficult for people to use including those of us at Cloudera[1]. I don't want anyone to see us as wanting to create complexity. We all benefit from a healthy Hadoop community. Technical: - Any modification to SequenceFile (and friends) worry me as so much is tightly bound to them. This is something I think is an artifact of people coding to the implementation rather than the interface, so to speak. - Generally, I agree with a lot of Owen's motivation (e.g. codifying the serialization system, using multiple libs to prove it's proper) but some of the implementation can be more forgiving of some usage patterns in the wild (e.g. the conflicting dep version issues, whether future dev on some these file formats should be extracted from the Hadoop proper). Proposal: - Codify (by vote) whether design plans are required or if an informal email indicating intent is sufficient, and under what circumstances. Provide examples to clarify circumstances. Solves the long term but not HADOOP-6685. - Focus the discussion on evaluation of proposals for remedying the process for conflict resolution. I know some exist, but they're drastic (removal of PMC members, for instance). - After consensus on above, focus the conversation (in another thread or on JIRA, whatever is most appropriate) on HADOOP-6685 so no one is blocked. - Put the community of users first in all areas of development and interaction. [1] I am officially speaking out of school. I am not an official spokesperson for Cloudera. This is my opinion and I happen to work at Cloudera. On Tue, Dec 7, 2010 at 3:13 AM, Owen O'Malley <[EMAIL PROTECTED]> wrote: > > On Dec 6, 2010, at 7:36 PM, Eric Sammer wrote: > > Eric, > Since this is mostly technical, it probably should be on the h-6685 jira > instead of general@hadoop. > >> I think Hadoop should support interfaces to control >> serialization plugin lifecycle and element serialization to / from an >> abstract notion of a datum and bytes only. > > The core of my h-6685 patch updates the API to replace the typename with a > serialization name and serialization specific metadata. That metadata is a > set of bytes that are defined by the serialization. The typename alone is > insufficient for Avro and having additional metadata will be useful for the > other serializations as well. > > Doug suggested that I add a user-friendly pair of methods and I did. While > they are redundant, the set of serializations isn't expected to be large and > therefore the extra code isn't much. > >> I would like to not mention >> a serialization implementation by name in Hadoop proper, at all. > > My patch removes some of the lingering references to Writables in > SequenceFile, MapFile, etc. and moves them over the generic serialization > API. The framework will likely continue depend on whichever serialization is > used for RPC. Currently that is Writables, but will likely transition to Eric Sammer twitter: esammer data: www.cloudera.com +
Eric Sammer 2010-12-07, 18:08
-
Re: [VOTE] Direction for Hadoop developmentArun C Murthy 2010-12-07, 15:55
On Dec 6, 2010, at 7:36 PM, Eric Sammer wrote: > I'm a bit confused as to how this equates with sequence files being > deprecated or arrested. I tried to read HADOOP-6685 but there's a lot > of internal references and context I feel like I'm missing. Suffice it > to say, sequence files can *not* be broken for existing data for the > reasons everyone has stated. If we choose to focus development > elsewhere ("soft deprecate") or actively encourage users elsewhere > ("@Deprecated") is an issue I think we can sever from this discussion. I'm surprised that your are confused. http://s.apache.org/h6685-veto Doug is very clear that he is vetoing the patch based on 2 reasons: a) dependency on PB b) extension to SequenceFile a) is technical, and we can debate about it. b) isn't. It's his 'vision' for the project, a vision which hasn't been ratified by the PMC. Arun +
Arun C Murthy 2010-12-07, 15:55
-
Re: [VOTE] Direction for Hadoop developmentJay Booth 2010-12-07, 16:06
a) On the PB dependency.. can't we just use JSON and call it a day? I
mean, we're gonna have a new dependency so that we can encode a single tuple? That doesn't even make engineering sense, let alone that the choice of PB looks like a deliberate decision to try and tweak Doug's nose, whether that was the intention or not. Even if you could make a case for some very minor benefit of using PB instead of one of the 3 serialization methods already on the classpath, it's hard to see why it's worth going to the mat over it. And again, as a user, every additional classpath element in Hadoop is a potential future conflict that I'll have to sort out for some non-exciting business process I'm writing. b) Agreeing with Eric.. backwards compatibility is essential for sequence file. It seems to me that past a certain point, it's easier to just make a new file format rather than cramming further functionality and backwards-compatibility layers into the SequenceFile class, but as long as it's backward compatible then I'm sure people will be fine. On Tue, Dec 7, 2010 at 10:55 AM, Arun C Murthy <[EMAIL PROTECTED]> wrote: > > On Dec 6, 2010, at 7:36 PM, Eric Sammer wrote: > > I'm a bit confused as to how this equates with sequence files being >> deprecated or arrested. I tried to read HADOOP-6685 but there's a lot >> of internal references and context I feel like I'm missing. Suffice it >> to say, sequence files can *not* be broken for existing data for the >> reasons everyone has stated. If we choose to focus development >> elsewhere ("soft deprecate") or actively encourage users elsewhere >> ("@Deprecated") is an issue I think we can sever from this discussion. >> > > I'm surprised that your are confused. > > http://s.apache.org/h6685-veto > > Doug is very clear that he is vetoing the patch based on 2 reasons: > a) dependency on PB > b) extension to SequenceFile > > a) is technical, and we can debate about it. > > b) isn't. It's his 'vision' for the project, a vision which hasn't been > ratified by the PMC. > > Arun +
Jay Booth 2010-12-07, 16:06
|