|
|
-
Large feature development
Steve Loughran 2012-08-31, 17:07
I'm going to split this out and raise it as a separate issue. On 29 August 2012 19:35, Jun Ping Du <[EMAIL PROTECTED]> wrote: > Hi Chris and all, > Thanks for initiating the discussion. Can I say something in a > prospective of contributor but not a committer or PMC member? > First, I have a feeling that current hadoop project process is good for > contributors to deliver a bug fix but not so easy to deliver a big feature. > I have great experience in bug fixing work that can get quickly response > from committers and checked in. However, I feel a little frustrated in > delivering a feature (~5K LOC, very important for hadoop running well on > virtualization infrastructure) across common, hdfs, map reduce and yarn. > Firstly, you have to figure out different committers you should turn for > help on each component, then convince them your ideas and work with them in > reviewing and committing the code. Each committers should understand the > completed story and learn the code pending on review as well as that > already checked in. If some committers are super busy, then the feature > looks like pending forever. Thus, due to my current experience, I may have > to say this process is not so friendly to contributors who come from > different organizations with different backgrounds but have the same wish > to contribute more to Apache hadoop. > One of the problems here is that a 5KLOC patch is a major change -and regardless of whether you are a committer or not, you're going to hit a lot of inertia. My fairly large service lifecycle patch( https://issues.apache.org/jira/browse/HDFS-326 ) never survived, and I put a lot of effort in there as a committer. That was with something that I was visibly doing in a branch of apache SVN, merging and regression testing every week, syncing things, testing on my own infrastructure, etc. Turning up with a large diff without any previous involvement in the project or collaborative development is going to hit a wall in pretty much every OSS project, the big issues not just being "why" and "what does it break", but "how is a patch this big going to be maintained?" and "how is it going to be tested on anything other than the specific platform it's been worked on". Any test plan that requires custom hardware, infrastructure &c is tricky. It's hard enough making the jump from the normal test suite to testing with real workloads on production-scale clusters, if you start needing specific CPU designs, GPUs, non-standard OS/JVM, etc, then it becomes impossible to regression test these for a release. To make things worse, Hadoop is a critical piece of so many companies infrastructure; Yahoo!, Facebook, Twitter, LinkedIn, &c. The value of the code is not the cost of implementation, it is the value of all the data stored in HDFS, This is why the barrier to entry of code is much, much lower in contrib/ than it is into the core -and the normal way to isolate work is to design another extension point into which these things can go, where people can be confident that changes won't break things, and where someone else takes on the costs of maintenance and testing their custom extensions. > Based on this, for spinning out hadoop sub-project to TLPs, I would > glad to see we will have concisely committer list for each projects then > committers can be more focus (more bandwidth may be?) and contributors can > know who they should turn to get quick response and help there. On the > other hand, I would concern it may take more complexity to dependencies for > features that across sub-project today as you should figure out branches > for each TLP but it is hard to estimate when code can come alive in each > branch of TLP (may take the similar complexity to committers as well). > I don't have many good suggestions but would be glad to see the process > can be more smoothly for contributor's work no matter what decision we are > making today. Just 2 cents. I do agree we need a better way of having larger activities that span more of the system being developed and then successfully committed. Some of the what-not-to-do & what-to-do has been hinted at in the bottom of Defining Hadoop ( http://wiki.apache.org/hadoop/Defining%20Hadoop ), but there's no formalisation of how to do more major works within the Hadoop codebase. Of the big changes that have worked, they are 1. HDFS 2's HA and ongoing improvements: collaborative dev on the list with incremental changes going on in trunk, RTC with lots of tests. This isn't finished, and the test problem there is that functional testing of all failure modes requires software-controlled fencing devices and switches -and tests to generated the expected failure space. 2. YARN: Arun on his own branch, CTR, merge once mostly stable, and completely replacing MRv1. How then do we get (a) more dev projects working and integrated by the current committers, and (b) a process in which people who are not yet contributors/committers can develop non-trivial changes to the project in a way that it is done with the knowledge, support and mentorship of the rest of the community? This topic has arisen before -and never reached a good answer. How can we incubate new pieces of work in the project and mentor external contributions? -steve
+
Steve Loughran 2012-08-31, 17:07
-
Re: Large feature development
Todd Lipcon 2012-09-01, 08:20
Thanks for starting this thread, Steve. I think your points below are good. I've snipped most of your comment and will reply inline to one bit below:
On Fri, Aug 31, 2012 at 10:07 AM, Steve Loughran <[EMAIL PROTECTED]> wrote:
> Of the big changes that have worked, they are > > > 1. HDFS 2's HA and ongoing improvements: collaborative dev on the list > with incremental changes going on in trunk, RTC with lots of tests. This > isn't finished, and the test problem there is that functional testing of > all failure modes requires software-controlled fencing devices and switches > -and tests to generated the expected failure space.
Actually, most of the HDFS HA code has been done on branches. The first work that led towards HA was the redesign of the edits logging infrastrucutre -- HDFS-1073. This was a feature branch with about 60 patches on it. Then HDFS-1623, the main manual-failover HA development, had close to 150 patches on the branch. Automatic HA (HDFS-3042) was some 15-20 patches. The current work (removing dependency on NAS) is around 35 patches in so far and getting close to merge.
In these various branches, we've experimented with a few policies which have differed from trunk. In particular: - HDFS-1073 had a "modified review then commit" policy, which was that, if a patch sat without a review for more than 24hrs, we committed it with the restriction that there would be a post-commit review before the branch was merged. - All of the branches have done away with the requirement of running the full QA suite, findbugs, etc prior to commit. This means that the branches at times have broken tests checked in, but also makes it quicker to iterate on the new feature. Again, the assumption is that these requirements are met before merge. - In all cases there has been a design doc and some good design discussion up front before substantial code was written. This made it easier to forge ahead on the branch with good confidence that the community was on-board with the idea.
Given my experiences, I think all of the above are useful to follow. It means development can happen quickly, but ensures that when the merge is proposed, people feel like the quality meets our normal standards.
> 2. YARN: Arun on his own branch, CTR, merge once mostly stable, and > completely replacing MRv1.
I'd actually contend that YARN was merged too early. I have yet to see anyone running YARN in production, and it's holding up the "Stable" moniker for Hadoop 2.0 -- HDFS-wise we are already quite stable and I'm seeing fewer issues in our customers running Hadoop HDFS 2 compared to Hadoop 1-derived code.
> > How then do we get (a) more dev projects working and integrated by the > current committers, and (b) a process in which people who are not yet > contributors/committers can develop non-trivial changes to the project in a > way that it is done with the knowledge, support and mentorship of the rest > of the community?
Here's one proposal, making use of git as an easy way to allow non-committers to "commit" code while still tracking development in the usual places: - Upon anyone's request, we create a new "Version" tag in JIRA. - The developers create an umbrella JIRA for the project, and file the individual work items as subtasks (either up front, or as they are developed if using a more iterative model) - On the umbrella, they add a pointer to a git branch to be used as the staging area for the branch. As they develop each subtask, they can use the JIRA to discuss the development like they would with a normally committed JIRA, but when they feel it is ready to go (not requiring a +1 from any committer) they commit to their git branch instead of the SVN repo. - When the branch is ready to merge, they can call a merge vote, which requires +1 from 3 committers, same as a branch being proposed by an existing committer. A committer would then use git-svn to merge their branch commit-by-commit, or if it is less extensive, simply generate a single big patch to commit into SVN.
My thinking is that this would provide a low-friction way for people to collaborate with the community and develop in the open, without having to work closely with any committer to review every individual subtask.
Another alternative, if people are reluctant to use git, would be to add a "sandbox/" repository inside our SVN, and hand out commit bit to branches inside there without any PMC vote. Anyone interested in contributing could request a branch in the sandbox, and be granted access as soon as they get an apache SVN account.
-Todd Todd Lipcon Software Engineer, Cloudera
+
Todd Lipcon 2012-09-01, 08:20
-
Re: Large feature development
Steve Loughran 2012-09-02, 14:58
On 1 September 2012 09:20, Todd Lipcon <[EMAIL PROTECTED]> wrote:
> Thanks for starting this thread, Steve. I think your points below are > good. I've snipped most of your comment and will reply inline to one > bit below: > > On Fri, Aug 31, 2012 at 10:07 AM, Steve Loughran > <[EMAIL PROTECTED]> wrote: > > > > > > How then do we get (a) more dev projects working and integrated by the > > current committers, and (b) a process in which people who are not yet > > contributors/committers can develop non-trivial changes to the project > in a > > way that it is done with the knowledge, support and mentorship of the > rest > > of the community? > > Both HDFS2 and MRv2 are in trunk, therefore I consider them successes. > Here's one proposal, making use of git as an easy way to allow > non-committers to "commit" code while still tracking development in > the usual places: >
This is effectively what people do. I'm less worried about the code side of things than the integration and mentoring > - Upon anyone's request, we create a new "Version" tag in JIRA. >
-1. There are enough versions. There is a "tag" field in JIRA for precisely this purpose > - The developers create an umbrella JIRA for the project, and file the > individual work items as subtasks (either up front, or as they are > developed if using a more iterative model) >
as today > - On the umbrella, they add a pointer to a git branch to be used as > the staging area for the branch. As they develop each subtask, they > can use the JIRA to discuss the development like they would with a > normally committed JIRA, but when they feel it is ready to go (not > requiring a +1 from any committer) they commit to their git branch > instead of the SVN repo. >
some integration w/ jenkins and pull testing would be good here > - When the branch is ready to merge, they can call a merge vote, which > requires +1 from 3 committers, same as a branch being proposed by an > existing committer. A committer would then use git-svn to merge their > branch commit-by-commit, or if it is less extensive, simply generate a > single big patch to commit into SVN. > > My thinking is that this would provide a low-friction way for people > to collaborate with the community and develop in the open, without > having to work closely with any committer to review every individual > subtask. > > Another alternative, if people are reluctant to use git, would be to > add a "sandbox/" repository inside our SVN, and hand out commit bit to > branches inside there without any PMC vote. Anyone interested in > contributing could request a branch in the sandbox, and be granted > access as soon as they get an apache SVN account. > > I don't see the technical issues with how the merge is done as the main problem.
The barriers to getting your stuff in are 1. getting people to care enough to help develop the feature -mentorship, collaborative development. 2. getting incremental parts in to avoid the continual merge-regression-test hell that you go through if you are trying to keep a separate branch alive. It's not the technical aspects of the merge so much as the need to run all the hadoop tests and your own test suite, and track down whether a failure is a regression in -trunk or something in your code.
Jun's patch is an example of this situation. We haven't seen the effort he and his colleagues have done with merge and test, but I'm confident it's been there. What they now have is a "big bang" class of patch which is so big that anyone reviewing it would have to spend a couple of weeks going through the codebase trying to understand it. Which as we all know means two weeks not doing all the things you are committed to doing.
We know it's there, we know it's current -so how to use this as an exercise in something to pull in incrementally?
-Steve
+
Steve Loughran 2012-09-02, 14:58
-
Re: Large feature development
Eli Collins 2012-09-02, 19:47
On Sun, Sep 2, 2012 at 7:58 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: > On 1 September 2012 09:20, Todd Lipcon <[EMAIL PROTECTED]> wrote: > >> Thanks for starting this thread, Steve. I think your points below are >> good. I've snipped most of your comment and will reply inline to one >> bit below: >> >> On Fri, Aug 31, 2012 at 10:07 AM, Steve Loughran >> <[EMAIL PROTECTED]> wrote: >> >> >> > >> > How then do we get (a) more dev projects working and integrated by the >> > current committers, and (b) a process in which people who are not yet >> > contributors/committers can develop non-trivial changes to the project >> in a >> > way that it is done with the knowledge, support and mentorship of the >> rest >> > of the community? >> >> > Both HDFS2 and MRv2 are in trunk, therefore I consider them successes. > > >> Here's one proposal, making use of git as an easy way to allow >> non-committers to "commit" code while still tracking development in >> the usual places: >> > > This is effectively what people do. I'm less worried about the code side of > things than the integration and mentoring > > >> - Upon anyone's request, we create a new "Version" tag in JIRA. >> > > -1. There are enough versions. There is a "tag" field in JIRA for precisely > this purpose > > >> - The developers create an umbrella JIRA for the project, and file the >> individual work items as subtasks (either up front, or as they are >> developed if using a more iterative model) >> > > as today > > >> - On the umbrella, they add a pointer to a git branch to be used as >> the staging area for the branch. As they develop each subtask, they >> can use the JIRA to discuss the development like they would with a >> normally committed JIRA, but when they feel it is ready to go (not >> requiring a +1 from any committer) they commit to their git branch >> instead of the SVN repo. >> > > some integration w/ jenkins and pull testing would be good here > > >> - When the branch is ready to merge, they can call a merge vote, which >> requires +1 from 3 committers, same as a branch being proposed by an >> existing committer. A committer would then use git-svn to merge their >> branch commit-by-commit, or if it is less extensive, simply generate a >> single big patch to commit into SVN. >> >> My thinking is that this would provide a low-friction way for people >> to collaborate with the community and develop in the open, without >> having to work closely with any committer to review every individual >> subtask. >> >> Another alternative, if people are reluctant to use git, would be to >> add a "sandbox/" repository inside our SVN, and hand out commit bit to >> branches inside there without any PMC vote. Anyone interested in >> contributing could request a branch in the sandbox, and be granted >> access as soon as they get an apache SVN account. >> >> > I don't see the technical issues with how the merge is done as the main > problem. > > The barriers to getting your stuff in are > 1. getting people to care enough to help develop the feature -mentorship, > collaborative development. > 2. getting incremental parts in to avoid the continual > merge-regression-test hell that you go through if you are trying to keep a > separate branch alive. It's not the technical aspects of the merge so much > as the need to run all the hadoop tests and your own test suite, and track > down whether a failure is a regression in -trunk or something in your code. > > Jun's patch is an example of this situation. We haven't seen the effort he > and his colleagues have done with merge and test, but I'm confident it's > been there. What they now have is a "big bang" class of patch which is so > big that anyone reviewing it would have to spend a couple of weeks going > through the codebase trying to understand it. Which as we all know means > two weeks not doing all the things you are committed to doing. > > We know it's there, we know it's current -so how to use this as an exercise > in something to pull in incrementally?
Jun's patches from HADOOP-8468 (which were developed on a private github repo) are being pulled in incrementally into trunk, there's no feature branch (which I think would have been a better route but at least the current approach has not prevented some progress).
All the recent examples of features that I can think of that have been developed upstream first at Apache on feature branches have gone well.
Thanks, Eli
+
Eli Collins 2012-09-02, 19:47
-
Re: Large feature development
Arun C Murthy 2012-09-01, 19:47
Todd,
On Sep 1, 2012, at 1:20 AM, Todd Lipcon wrote:
> I'd actually contend that YARN was merged too early. I have yet to see > anyone running YARN in production, and it's holding up the "Stable" > moniker for Hadoop 2.0 -- HDFS-wise we are already quite stable and > I'm seeing fewer issues in our customers running Hadoop HDFS 2 > compared to Hadoop 1-derived code.
You know I respect you a ton, but I'm very saddened to see you perpetuate this FUD on our public lists. I expected better, particularly when everyone is working towards the same goals of advancing Hadoop-2. This sniping on other members doing work is, um, I'll just stop here rather than regret later.
I'm pretty sure you realize this (we've talked about this privately), yet, for other users who might not be aware: # YARN has been deployed on, by almost everyone's standards, on a very LARGE ~450 node cluster for 6 months now at Yahoo. # The entire YARN & MapReduce developer community has done an enormous amount of testing, compatibility work and performance work for many months now. It's been clear that YARN/MRv2 is superior to MR1 on every dimension - performance (2x in several cases), scale etc.; all dimensions which are critical for Hadoop's success in the past and future. # Not just MR, this work has been done across the stack - Pig, Oozie, HCatalog etc. This has been an enormous amount of work not just by YARN/MRv2, but by all these communities. # Many thousands of unique end-user applications at Yahoo have *certified* YARN/MRv2. That is pretty much *all* MapReduce, Pig etc. applications at Yahoo - the most advanced Hadoop deployment in the world. # It is now *days* away from being deployed on one of the largest and most demanding Hadoop clusters in the world with several *thousand* nodes and millions of applications per month. See Bobby's note if you don't believe me.
Notice, I didn't talk about any of the other benefits of YARN such as other frameworks to MR etc. - you'll see more of this such as real-time applications on Hadoop clusters over the next many months. For e.g. see discussions on Storm/S4 lists about YARN prototypes at various stages of availability.
Paying you back with the same coin, after being declared *done*, HDFS2 had several BASIC issues such as a non-working upgrade from hadoop-1 (HDFS-3731, HDFS-3579) or edit-log corruption (HDFS-3626). Maybe you or the customers you talk about don't care about it, whatever. For e.g. is the QJM work part of stable HDFS2? It's not even code complete yet.
IAC, It's pretty obvious we have different standards for declaring HDFS stable v/s YARN/MRv2 as stable. The standards I'm used to, being around since the dawn of this project, is what I use to measure stability i.e. deployed and stable for weeks/months on some of the largest Hadoop clusters in the world before letting it loose on other 'customers'.
Given that upgrade-failures or data-corruption is acceptable, is YARN 'stable'? By the same standards - YES! - for many months now, much before HDFS HA was even code complete!
I don't want to engage in a debate on this further or expect you to care about YARN/MRv2, but please, for heavens' sake, do not publicly diss the work so many people have done for many, many months now or accuse them of *holding up Hadoop* - it's very poor form.
I'm very proud to have contributed to this effort, even more to have worked with such a talented and dedicated bunch. A acknowledgement would be nice, but the least I/we *do* expect is absence of public sniping by other members of the Hadoop community.
respectfully, Arun
+
Arun C Murthy 2012-09-01, 19:47
-
Re: Large feature development
Eli Collins 2012-09-02, 20:00
On Sat, Sep 1, 2012 at 12:47 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote: > Todd, > > On Sep 1, 2012, at 1:20 AM, Todd Lipcon wrote: > >> I'd actually contend that YARN was merged too early. I have yet to see >> anyone running YARN in production, and it's holding up the "Stable" >> moniker for Hadoop 2.0 -- HDFS-wise we are already quite stable and >> I'm seeing fewer issues in our customers running Hadoop HDFS 2 >> compared to Hadoop 1-derived code. > > You know I respect you a ton, but I'm very saddened to see you perpetuate this FUD on our public lists. I expected better, particularly when everyone is working towards the same goals of advancing Hadoop-2. This sniping on other members doing work is, um, I'll just stop here rather than regret later.
Todd is just saying that:
1. HDFS v2 has fewer critical bugs than v1 (mostly thanks to the edit log rewrite, which aside from HA was motivated by all the quality issues the v1 code has had)
2. HDFS is more mature than YARN. Not a surprise given that we all agree YARN is alpha, and a much newer project than HDFS that hasn't yet been deployed in production environments yet (to my knowledge).
I don't read this as a snipe against anyone coding on Hadoop, it's just that the two sub-projects are at different stages in their life and development.
Thanks, Eli
+
Eli Collins 2012-09-02, 20:00
-
Re: Large feature development
Arun Murthy 2012-09-02, 22:11
Eli,
On Sep 2, 2012, at 1:01 PM, Eli Collins <[EMAIL PROTECTED]> wrote:
> On Sat, Sep 1, 2012 at 12:47 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote: >> Todd, >> >> On Sep 1, 2012, at 1:20 AM, Todd Lipcon wrote: >> >>> I'd actually contend that YARN was merged too early. I have yet to see >>> anyone running YARN in production, and it's holding up the "Stable" >>> moniker for Hadoop 2.0 -- HDFS-wise we are already quite stable and >>> I'm seeing fewer issues in our customers running Hadoop HDFS 2 >>> compared to Hadoop 1-derived code. >> >> You know I respect you a ton, but I'm very saddened to see you perpetuate this FUD on our public lists. I expected better, particularly when everyone is working towards the same goals of advancing Hadoop-2. This sniping on other members doing work is, um, I'll just stop here rather than regret later. > 2. HDFS is more mature than YARN. Not a surprise given that we all > agree YARN is alpha, and a much newer project than HDFS that hasn't > yet been deployed in production environments yet (to my knowledge).
Let's focus on the ground reality here.
Please read my (or Rajiv's) message again about YARN's current stability and how much it's baked, it's deployment plans to a very large cluster in a few *days*. Or, talk to the people developing, testing and supporting these customers and clusters.
I'll repeat - YARN has clearly baked much more than HDFS HA given the basic bugs (upgrade, edit logs corruption etc.) we've seen after being declared *done*; but then we just disagree since clearly I'm more conservative. Also, we need to be more conservative wrt HDFS - but then what would I know...
I'll admit it's hard to discuss with someone (or a collective) who just repeat themselves. Plus, I broke my own rule about email this weekend - so, I'll try harder.
Arun
+
Arun Murthy 2012-09-02, 22:11
-
Re: Large feature development
Todd Lipcon 2012-09-03, 01:12
Hey Arun,
First, let me apologize if my email came off as a personal "snipe" against the project or anyone working on it. I know the team has been hard at work for multiple years now on the project, and I certainly don't mean to denigrate the work anyone has done. I also agree that the improvements made possible by YARN are tremendously important, and I've expressed this opinion both online and in interviews with analysts, etc.
But, I'll stand by my point that YARN is at this point more "alpha" than HDFS2. You brought up two bugs in the HDFS2 code base as examples of HDFS 2 not being high quality. The first, HDFS-3626, was indeed a messy bug, but had nothing to do with HA, the edit log rewrite, or any other of the changes being discussed in the thread. In fact, the bug has been there since the "beginning of time", and is in fact present in Hadoop 1.0.x as well (which is why the JIRA is still open). You simply need to pass a non-canonicalized path by the Path(URI) constructor, and you'll see the same behavior in every release including 1.0.x, 0.20.x, or earlier. The reason it shows up more often in Hadoop 2 was actually due to the FsShell rewrite -- not any changes in HDFS itself, and certainly not related to HA like you've implied here.
The other bug causes blocksBeingWritten to disappear upon upgrade. This, also, had nothing to do with any of the features being discussed in this thread, and in fact only impacts a cluster which is taken down _uncleanly_ prior to an upgrade. Upon starting the upgraded cluster, the user would be alerted to the missing blocks and could rollback with no lost data. So, while it should be fixed (and has been), I wouldn't consider it particularly frightening. Most users I am aware of do a "clean" shutdown of services like HBase before trying to upgrade their cluster, and, worst case, they would see the issue immediately after the upgrade and perform a rollback with no adverse effects.
In branch-1, however, I've seen other bugs that I'd consider much more scary. Two in particular come to mind and together represent the vast majority of cases in which we've seen customers experience data corruption: HDFS-3652 and HDFS-2305. These two bugs were branch-1 only, and never present in Hadoop 2 due to the "edit log rewrite" project (HDFS-1073).
So, at risk of this thread just becoming a laundry list of bugs that have existed in HDFS, or a list of bugs in YARN, I'll summarize: I still think that YARN is "alpha" and HDFS 2 is at least as "stable" as Hadoop 1.0. We have customers running it for production workloads, in multi-rack clusters, with great success. But this has nothing to do with this thread at hand, so I'll raise the question of alpha/beta/stable labeling in the context of our next release vote, and hope we can go back to the more fruitful discussion of how to encourage large feature development while maintaining stability.
Thanks -Todd
On Sun, Sep 2, 2012 at 3:11 PM, Arun Murthy <[EMAIL PROTECTED]> wrote: > Eli, > > On Sep 2, 2012, at 1:01 PM, Eli Collins <[EMAIL PROTECTED]> wrote: > >> On Sat, Sep 1, 2012 at 12:47 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote: >>> Todd, >>> >>> On Sep 1, 2012, at 1:20 AM, Todd Lipcon wrote: >>> >>>> I'd actually contend that YARN was merged too early. I have yet to see >>>> anyone running YARN in production, and it's holding up the "Stable" >>>> moniker for Hadoop 2.0 -- HDFS-wise we are already quite stable and >>>> I'm seeing fewer issues in our customers running Hadoop HDFS 2 >>>> compared to Hadoop 1-derived code. >>> >>> You know I respect you a ton, but I'm very saddened to see you perpetuate this FUD on our public lists. I expected better, particularly when everyone is working towards the same goals of advancing Hadoop-2. This sniping on other members doing work is, um, I'll just stop here rather than regret later. >> 2. HDFS is more mature than YARN. Not a surprise given that we all >> agree YARN is alpha, and a much newer project than HDFS that hasn't
Todd Lipcon Software Engineer, Cloudera
+
Todd Lipcon 2012-09-03, 01:12
-
Re: Large feature development
Arun C Murthy 2012-09-03, 07:05
Todd, On Sep 2, 2012, at 6:12 PM, Todd Lipcon wrote: > First, let me apologize if my email came off as a personal "snipe" > against the project or anyone working on it. I know the team has been > hard at work for multiple years now on the project, and I certainly > don't mean to denigrate the work anyone has done. > > But, I'll stand by my point that YARN is at this point more "alpha" > than HDFS2. I'll unfair to tag-team me while consistently ignoring what I write. (We are also in danger of hitting the threefold repetition rule: http://en.wikipedia.org/wiki/Threefold_repetition. *smile*) Anyway, I'l repeat, here are the facts on the ground - the work we've done testing/stabilizing YARN/MRv2, it's stability, user-certification across thousands of unique apps, deployment etc. etc.: http://s.apache.org/QVX> You brought up two bugs in the HDFS2 code base as examples > of HDFS 2 not being high quality. Through a lot of words you just agreed with what I said - if people didn't upgrade to HDFS2 (not just HA) they wouldn't hit any of these: HDFS-3626, HDFS-3731 etc. There are more, for e.g. how do folks work around Secondary NN not starting up on upgrades from hadoop-1 (HDFS-3597)? They just copy multiple PBs over to a new hadoop-2 cluster, or patch SNN themselves post HDFS-1073? Anyway, I agree, we should talk about this in context of an actual release - hadoop-2.1.0 should mark YARN as *beta* IMO - particularly since it will be deployed at scale. Arun
+
Arun C Murthy 2012-09-03, 07:05
-
Re: Large feature development
Todd Lipcon 2012-09-03, 07:31
On Mon, Sep 3, 2012 at 12:05 AM, Arun C Murthy <[EMAIL PROTECTED]> wrote: >> >> But, I'll stand by my point that YARN is at this point more "alpha" >> than HDFS2. > > I'll unfair to tag-team me while consistently ignoring what I write.
I'm not sure I ignored what you wrote. I understand that Yahoo is deploying soon on one of their clusters. That's great news. My original point was about the state of YARN when it was merged, and the comment about its current state was more of an aside. Hardly worth debating further. Best of luck with the deployment next week - I look forward to reading about how it goes on the list.
>> You brought up two bugs in the HDFS2 code base as examples >> of HDFS 2 not being high quality. > > Through a lot of words you just agreed with what I said - if people didn't upgrade to HDFS2 (not just HA) they wouldn't hit any of these: HDFS-3626,
You could hit this on Hadoop 1, it was just harder to hit.
> HDFS-3731 etc.
The details of this bug have to do with the upgrade/snapshot behavior of the blocksBeingWritten directory which was added in branch-1. In fact, the same basic bug continues to exist in branch-1. If you perform an upgrade, it doesn't hard-link the blocks into the new "current" directory. Hence, if the upgraded cluster exits safe mode (causing lease recovery of those blocks), and then the user issues a rollback, the blocks will have been deleted from the pre-upgrade image. This broken branch-1 behavior carried over into branch-2 as well, but it's not a new bug, as I said before.
> There are more, for e.g. how do folks work around Secondary NN not starting up on upgrades from hadoop-1 (HDFS-3597)? They just copy multiple PBs over to a new hadoop-2 cluster, or patch SNN themselves post HDFS-1073?
No, they rm -Rf the contents of the 2NN directory, which is completely safe and doesn't data loss in any way. In fact, the bug fix is exactly that -- it just does the rm -Rf itself, automatically. It's a trivial workaround similar to how other bugs in the Hadoop 1 branch have required workarounds in the past. Certainly no data movement or local patching. The SNN is transient state and can always be cleared.
If you have any questions about other bugs in the 2.x line, feel free to ask on the relevant JIRAs. I'm still perfectly confident in the stability of HDFS 2 vs HDFS 1. In fact my cell phone is likely the one that would ring if any of these production HDFS 2 clusters had an issue, and I'll offer the same publicly to anyone on this list. If you experience a corruption or data loss issue on the tip of branch-2 HDFS, email me off-list and I'll personally diagnose the issue. I would not make that same offer for branch-1 due to the fundamentally less robust design which has caused a lot of subtle bugs over the past several years.
Thanks -Todd -- Todd Lipcon Software Engineer, Cloudera
+
Todd Lipcon 2012-09-03, 07:31
-
Re: Large feature development
Arun C Murthy 2012-09-03, 07:48
On Sep 3, 2012, at 12:31 AM, Todd Lipcon wrote:
> On Mon, Sep 3, 2012 at 12:05 AM, Arun C Murthy <[EMAIL PROTECTED]> wrote: >>> >>> But, I'll stand by my point that YARN is at this point more "alpha" >>> than HDFS2. >> >> I'll unfair to tag-team me while consistently ignoring what I write. > > I'm not sure I ignored what you wrote. I understand that Yahoo is > deploying soon on one of their clusters. That's great news. My > original point was about the state of YARN when it was merged, and the > comment about its current state was more of an aside. Hardly worth > debating further. Best of luck with the deployment next week - I look > forward to reading about how it goes on the list.
Everyone +1'ed the merge, now we'd like to rewrite history? Also, it's current state is much that what you trivialized as 'deployed to one cluster' - again, please read my email on the effort we've undertaken to get where we are. That's a lot of work by many tens of people - hardly good form to trivialize them as you did.
Arun
+
Arun C Murthy 2012-09-03, 07:48
-
Re: Large feature development
Arun C Murthy 2012-09-03, 07:22
On Sep 3, 2012, at 12:05 AM, Arun C Murthy wrote:
> Todd, > > I'll unfair to tag-team me while consistently ignoring what I write.
Ugh, late Sunday night school-boy error - should have read:
I'll point out it's unfair [...]
Arun
+
Arun C Murthy 2012-09-03, 07:22
-
Re: Large feature development
Rajiv Chittajallu 2012-09-01, 21:29
Its unfortunate that certain work, an year after accepted in to the main line, being attributed to a single person. There is significant amount of work done by people who are not in the PMC or a commiter, especially to get it running in production. For those who have been associated with running hadoop before its became synonymous with 'BigData', stabilizing major release takes time. With more critical systems dependent on hadoop, transitioning to new feature set would take longer. hadoop-0.20 took ~8 months. IMHO, months after a feature set is accepted in to the mainline, it may not be appropriate to question its quality.
In next couple of months, we are planning to widely deploy 0.23.3 release by Bobby. As with any major release, I know this is not going to be a smooth ride.
-rajive ----- Original Message ----- > From: Todd Lipcon <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Cc: > Sent: Saturday, September 1, 2012 1:20 AM > Subject: Re: Large feature development > >T hanks for starting this thread, Steve. I think your points below are > good. I've snipped most of your comment and will reply inline to one > bit below: > > On Fri, Aug 31, 2012 at 10:07 AM, Steve Loughran > <[EMAIL PROTECTED]> wrote: > >> Of the big changes that have worked, they are >> >> >> 1. HDFS 2's HA and ongoing improvements: collaborative dev on the > list >> with incremental changes going on in trunk, RTC with lots of tests. This >> isn't finished, and the test problem there is that functional > testing of >> all failure modes requires software-controlled fencing devices and > switches >> -and tests to generated the expected failure space. > > Actually, most of the HDFS HA code has been done on branches. The > first work that led towards HA was the redesign of the edits logging > infrastrucutre -- HDFS-1073. This was a feature branch with about 60 > patches on it. Then HDFS-1623, the main manual-failover HA > development, had close to 150 patches on the branch. Automatic HA > (HDFS-3042) was some 15-20 patches. The current work (removing > dependency on NAS) is around 35 patches in so far and getting close to > merge. > > In these various branches, we've experimented with a few policies > which have differed from trunk. In particular: > - HDFS-1073 had a "modified review then commit" policy, which was > that, if a patch sat without a review for more than 24hrs, we > committed it with the restriction that there would be a post-commit > review before the branch was merged. > - All of the branches have done away with the requirement of running > the full QA suite, findbugs, etc prior to commit. This means that the > branches at times have broken tests checked in, but also makes it > quicker to iterate on the new feature. Again, the assumption is that > these requirements are met before merge. > - In all cases there has been a design doc and some good design > discussion up front before substantial code was written. This made it > easier to forge ahead on the branch with good confidence that the > community was on-board with the idea. > > Given my experiences, I think all of the above are useful to follow. > It means development can happen quickly, but ensures that when the > merge is proposed, people feel like the quality meets our normal > standards. > >> 2. YARN: Arun on his own branch, CTR, merge once mostly stable, and >> completely replacing MRv1. > > I'd actually contend that YARN was merged too early. I have yet to see > anyone running YARN in production, and it's holding up the > "Stable" > moniker for Hadoop 2.0 -- HDFS-wise we are already quite stable and > I'm seeing fewer issues in our customers running Hadoop HDFS 2 > compared to Hadoop 1-derived code. > >> >> How then do we get (a) more dev projects working and integrated by the >> current committers, and (b) a process in which people who are not yet >> contributors/committers can develop non-trivial changes to the project in a
+
Rajiv Chittajallu 2012-09-01, 21:29
-
Re: Large feature development
Arun Murthy 2012-09-01, 22:33
Rajiv,
I'm pretty sure you mean '*blame* for certain work, [...], being attributed' ... :)
I certainly find blame for failures much more palatable than credit for collective successes.
IAC, thanks for chiming in, Hadoop will be better with you being more present at the forefront.
Arun
On Sep 1, 2012, at 2:30 PM, Rajiv Chittajallu <[EMAIL PROTECTED]> wrote:
> Its unfortunate that certain work, an year after accepted in to the main line, being attributed to a single person. There is significant amount of work done by people who are not in the PMC or a commiter, especially to get it running in production. For those who have been associated with running hadoop before its became synonymous with 'BigData', stabilizing major release takes time. With more critical systems dependent on hadoop, transitioning to new feature set would take longer. hadoop-0.20 took ~8 months. > > > IMHO, months after a feature set is accepted in to the mainline, it may not be appropriate to question its quality. > > In next couple of months, we are planning to widely deploy 0.23.3 release by Bobby. As with any major release, I know this is not going to be a smooth ride. > > -rajive > > > ----- Original Message ----- >> From: Todd Lipcon <[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED] >> Cc: >> Sent: Saturday, September 1, 2012 1:20 AM >> Subject: Re: Large feature development >> >> T hanks for starting this thread, Steve. I think your points below are >> good. I've snipped most of your comment and will reply inline to one >> bit below: >> >> On Fri, Aug 31, 2012 at 10:07 AM, Steve Loughran >> <[EMAIL PROTECTED]> wrote: >> >>> Of the big changes that have worked, they are >>> >>> >>> 1. HDFS 2's HA and ongoing improvements: collaborative dev on the >> list >>> with incremental changes going on in trunk, RTC with lots of tests. This >>> isn't finished, and the test problem there is that functional >> testing of >>> all failure modes requires software-controlled fencing devices and >> switches >>> -and tests to generated the expected failure space. >> >> Actually, most of the HDFS HA code has been done on branches. The >> first work that led towards HA was the redesign of the edits logging >> infrastrucutre -- HDFS-1073. This was a feature branch with about 60 >> patches on it. Then HDFS-1623, the main manual-failover HA >> development, had close to 150 patches on the branch. Automatic HA >> (HDFS-3042) was some 15-20 patches. The current work (removing >> dependency on NAS) is around 35 patches in so far and getting close to >> merge. >> >> In these various branches, we've experimented with a few policies >> which have differed from trunk. In particular: >> - HDFS-1073 had a "modified review then commit" policy, which was >> that, if a patch sat without a review for more than 24hrs, we >> committed it with the restriction that there would be a post-commit >> review before the branch was merged. >> - All of the branches have done away with the requirement of running >> the full QA suite, findbugs, etc prior to commit. This means that the >> branches at times have broken tests checked in, but also makes it >> quicker to iterate on the new feature. Again, the assumption is that >> these requirements are met before merge. >> - In all cases there has been a design doc and some good design >> discussion up front before substantial code was written. This made it >> easier to forge ahead on the branch with good confidence that the >> community was on-board with the idea. >> >> Given my experiences, I think all of the above are useful to follow. >> It means development can happen quickly, but ensures that when the >> merge is proposed, people feel like the quality meets our normal >> standards. >> >>> 2. YARN: Arun on his own branch, CTR, merge once mostly stable, and >>> completely replacing MRv1. >> >> I'd actually contend that YARN was merged too early. I have yet to see >> anyone running YARN in production, and it's holding up the
+
Arun Murthy 2012-09-01, 22:33
|
|