Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # general >> [VOTE] Direction for Hadoop development

Copy link to this message
Re: [VOTE] Direction for Hadoop development
On 12/07/2010 02:37 PM, Roy T. Fielding wrote:
> Good, these are technical reasons.  The benefits can be cleared by docs.
> By incompatible, I assume you mean forward-compatibility of old versions
> of Hadoop reading newer files.  Can we fix that by having the new
> implementation use the old file format by default until it is configured
> to use one of the new interfaces for writing?


> You keep referring to the kernel as if it were a product.  I don't see
> a kernel product in the list of things released by Apache Hadoop.

The line is fairly clear.  The kernel is the daemons plus the framework
code that invokes user code.  The set of pluggable user implementations
is fairly small: InputFormat, OutputFormat, Mapper, Reducer, RawComparator.

SequenceFile was originally part of the kernel but is now only used by
user-level InputFormats and OutputFormats.

> If there were such a product, then it would make sense for Apache Hadoop
> to also release ancillary products for common libraries, test frameworks,
> and modular storage interfaces.  Rearchitecting the Hadoop product suite
> into such a logical arrangement would make sense, and after such an
> architecture is put into place then "keeping the kernel simple" would
> be a reason to veto a change to the kernel.

Such a re-arrangement has been proposed but not completed.  Relevant
issues are MAPREDUCE-1638, MAPREDUCE-1453, and MAPREDUCE-1700.  It
mostly involves build issues; the architecture already largely supports
the distinction.

>> Tom long ago provided patches showing how the existing
>> configuration system can provide equivalent extension
>> implementations outside of the kernel with no incompatible changes.
>> (MAPREDUCE-376 and MAPREDUCE-377)
> They both seem to be active and unfinished.  If they are equivalent fixes
> to the same problem, then I suggest applying them to a branch, documenting
> how they work, and then agreeing to have a bake-off.  A bake-off is a
> decision made by performance and feature-completeness as an objective
> way to resolve an impasse due to mutually exclusive vetoes.  All sides agree
> to drop the veto and accept whichever performs best, by majority decision.

A bake-off could be a good way to resolve this.  Performance differences
would not likely be measurable, but folks might examine user programs
and consider compatibility and support implications and vote accordingly.

> All action items can be voted on.  What we are talking about here is a
> short term plan, and it is listed as a type of action item under
> changes to products.

Then voting on specific short-term actions might be a good way to
resolve this.

Some specific short-term questions we might vote on:

1. Should we add specific versions of Protocol Buffers and Thrift to the
classpath of every MapReduce program?

2. Should SequenceFile be forward-compatible, i.e., if an existing
program that stores Writables in a SequenceFile is run against the new
version, should the old version still be able to read the output of the
new version?

3. Should we continue support a specified interchange format and/or data
model for configuration data, or should configurations rather be opaque
binary data?  An interchange format might be JSON.  An interchange data
model might  Map<String,Value> where values can be strings, booleans,
numbers, bytes or nested configuration data, defined by a standard API
that all configurable items would support.  A specified format or model
would permit things like using -D to set configuration options and
permit generic interaction with external configuration systems.  With
opaque binary configurations, each configurable item would provide its
own API and would require specific new code that calls this API for each
parameter that could be set with -D or from an external configuration

>>> They are also subject to veto if and only if they
>>> are to be applied to the current release branch (or a released branch).
>> Owen intends to merge this patch to a release branch.

So votes on action items would be simple majority if they're not
intended to be merged to a release branch, and vetoable if they are?  Is
that right?