Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # general - Large feature development

Copy link to this message
Re: Large feature development
Todd Lipcon 2012-09-01, 08:20
Thanks for starting this thread, Steve. I think your points below are
good. I've snipped most of your comment and will reply inline to one
bit below:

On Fri, Aug 31, 2012 at 10:07 AM, Steve Loughran

> Of the big changes that have worked, they are
>    1. HDFS 2's HA and ongoing improvements: collaborative dev on the list
>    with incremental changes going on in trunk, RTC with lots of tests. This
>    isn't finished, and the test problem there is that functional testing of
>    all failure modes requires software-controlled fencing devices and switches
>    -and tests to generated the expected failure space.

Actually, most of the HDFS HA code has been done on branches. The
first work that led towards HA was the redesign of the edits logging
infrastrucutre -- HDFS-1073. This was a feature branch with about 60
patches on it. Then HDFS-1623, the main manual-failover HA
development, had close to 150 patches on the branch. Automatic HA
(HDFS-3042) was some 15-20 patches. The current work (removing
dependency on NAS) is around 35 patches in so far and getting close to

In these various branches, we've experimented with a few policies
which have differed from trunk. In particular:
- HDFS-1073 had a "modified review then commit" policy, which was
that, if a patch sat without a review for more than 24hrs, we
committed it with the restriction that there would be a post-commit
review before the branch was merged.
- All of the branches have done away with the requirement of running
the full QA suite, findbugs, etc prior to commit. This means that the
branches at times have broken tests checked in, but also makes it
quicker to iterate on the new feature. Again, the assumption is that
these requirements are met before merge.
- In all cases there has been a design doc and some good design
discussion up front before substantial code was written. This made it
easier to forge ahead on the branch with good confidence that the
community was on-board with the idea.

Given my experiences, I think all of the above are useful to follow.
It means development can happen quickly, but ensures that when the
merge is proposed, people feel like the quality meets our normal

>    2. YARN: Arun on his own branch, CTR, merge once mostly stable, and
>    completely replacing MRv1.

I'd actually contend that YARN was merged too early. I have yet to see
anyone running YARN in production, and it's holding up the "Stable"
moniker for Hadoop 2.0 -- HDFS-wise we are already quite stable and
I'm seeing fewer issues in our customers running Hadoop HDFS 2
compared to Hadoop 1-derived code.

> How then do we get (a) more dev projects working and integrated by the
> current committers, and (b) a process in which people who are not yet
> contributors/committers can develop non-trivial changes to the project in a
> way that it is done with the knowledge, support and mentorship of the rest
> of the community?

Here's one proposal, making use of git as an easy way to allow
non-committers to "commit" code while still tracking development in
the usual places:
- Upon anyone's request, we create a new "Version" tag in JIRA.
- The developers create an umbrella JIRA for the project, and file the
individual work items as subtasks (either up front, or as they are
developed if using a more iterative model)
- On the umbrella, they add a pointer to a git branch to be used as
the staging area for the branch. As they develop each subtask, they
can use the JIRA to discuss the development like they would with a
normally committed JIRA, but when they feel it is ready to go (not
requiring a +1 from any committer) they commit to their git branch
instead of the SVN repo.
- When the branch is ready to merge, they can call a merge vote, which
requires +1 from 3 committers, same as a branch being proposed by an
existing committer. A committer would then use git-svn to merge their
branch commit-by-commit, or if it is less extensive, simply generate a
single big patch to commit into SVN.

My thinking is that this would provide a low-friction way for people
to collaborate with the community and develop in the open, without
having to work closely with any committer to review every individual

Another alternative, if people are reluctant to use git, would be to
add a "sandbox/" repository inside our SVN, and hand out commit bit to
branches inside there without any PMC vote. Anyone interested in
contributing could request a branch in the sandbox, and be granted
access as soon as they get an apache SVN account.

Todd Lipcon
Software Engineer, Cloudera