Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # general - [DISCUSS] Hadoop Security Release off Yahoo! patchset

Copy link to this message
Re: [DISCUSS] Hadoop Security Release off Yahoo! patchset
Eric Baldeschwieler 2011-01-14, 18:25
Hi Ian,

Thanks for holding off on that last .5. I've been working in a big email giving move context on this. Let me preview some issues.

Our goal with this branch is two fold: 1) get the code out in a branch quickly so we an collaborate on it with the community. 2) not change the character of the code. See testing below. We're happy to compromise any other dimension, as long as we can do 1&2 above.

1) I agree this is not a good precedent. We don't support mega-patches in general. We are doing this as part of discontinuing the "yahoo distribution of Hadoop".  We don't plan to continue doing 30 person year projects outside apache and then merging them in!!

2) append is hard. It is so hard we rewrote the entire write pipeline (5 person-years work) in trunk after giving up on the codeline you are suggesting we merge in. That work is what distinguishes all post 20 releases from 20 releases in my mind. I dont trust the 20 append code line. We've been hurt badly by it.  We did the rewrite only after losing a bunch of production data a bunch of times with the previous code line.  I think the various 20 append patch lines may be fine for specialized hbase clusters, but they doesn't have the rigor behind them to bet your business in them.

3) I think having a very stable recent codeline available for teams coming into Hadoop who want to run big business apps and contribute code back is very helpful. I've been talking to folks in other orgs and they've expressed a huge amount of interest in this work, but begged us to put it into apache, so their oversight bodies will let them use it.

4) we're happy to incorporate ideas into how to best merge the work into trunk. Let's find the most cost effective way to preserve the most devel data possible.

5) testing. Ian, I think you do us a disservice when you talk about us just testing in our environments. If you look at the history of the project, we've been the force behind every stable release of apache Hadoop.  And all the non-apache Hadoop release had been tracking this patch set. We fully support the community developing independent testing capabilities.  We plan to contribute to that effort.  But we are the organization with far and away the best record for testing Hadoop.

We are proud of thus release, we want to share it. Help us sort out how.


E14 - via iPhone

On Jan 14, 2011, at 6:15 AM, "Ian Holsman" <[EMAIL PROTECTED]> wrote:

> (with my Apache hat on)
> I'm -0.5 on doing this as one big mega-patch and not including append (as opposed to a series of smaller patches).
> for the following reasons:
> 1. It encourages bad behavior. We want discussion (and development) to happen on the lists, not in some office. By allowing these large code-dumps it condones this behavior, and we will likely see it again and again. Like it or not, this is not the apache model of open source governance.
> 2. There is a risk that some code that is not in a JIRA or separate patch creeps in unwittingly. This isn't a major deal per se, but we don't really have the proper paper trail, or the documentation on what bug it fixed etc etc.
> 3. Other groups (Facebook for example) are running with their own set of patches. They currently have the luxury of examining each individual patch to decide if they want to integrate it (and test it) in their environment. We are forcing them to do the work of finding the bits they want in this huge patch.
> 4. By not including the append patch, we are making this release unusable for a large portion of our community who run hbase.
> 5. It makes it very hard to test. While It makes me comfortable that it has gone through Yahoo!'s QA and is running in their environments, it doesn't mean that it will work in other organizations who have different workload mixes and software running on them. With one huge patch it makes it all or nothing.. either they take the code-drop and perform a large QA-integration effort, or they forgo the whole patch together.