Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # general >> [DISCUSS] Spin out MR, HDFS and YARN as their own TLPs and disband Hadoop umbrella project


Copy link to this message
-
Re: [DISCUSS] Spin out MR, HDFS and YARN as their own TLPs and disband Hadoop umbrella project
On Wed, Aug 29, 2012 at 4:54 PM, Mattmann, Chris A (388J)
<[EMAIL PROTECTED]> wrote:
>
> Please provide examples that show umbrella projects work.

Hadoop, in its current form?

The code bases are tightly intertwined. We pulled out Pig/Hive/HBase
because they were substantial codebases that didn't share much code
with the rest, and thus could reasonably be expected to release
independently.

We could get HDFS and MR to that point, but we haven't yet, because
they rely so much on Common.

If we copy-paste forked Common, we'd be doubling our maintenance work
on this shared code. We basically did this with the IPC code for
HBase, and then we had double the work to protobuf-ify both HBase and
HDFS/MR earlier this year. I know because I spent a bunch of hours on
both.

> I've been
> at this Foundation a lot longer than you have. I've seen them not work
> and have been involved in ones that don't work. See splits from Lucene,
> the same threads (with different names, different products, different software
> but the exact same issues). See your own splits from Hadoop cited elsethread.
> See the friggin' Apache board minutes discussing why umbrella projects
> are bad.
>
> I don't know what else to tell you. I'm not going to go look up all the threads.
> I'm not Google nor do I care to. All I can say is that I've seen it before and
> so have others. In your own project.
>

What's one concrete example of where it would be better if we split? I
can't think of any. We'd still have competing interests in HDFS, and
we'd still get in the same arguments.

To say that all ASF projects should work the same seems pretty bizarre
to me. The ASF provides license protection, infrastructure, and a set
of guidelines for what makes successful projects. But I don't think it
is the foundation's place to dictate what its projects should do "from
above" if the projects themselves do not see a problem.

If the project is so messed up, then maybe some folks should fork it
into the incubator like you've suggested? What's wrong with the
anarchic "let the best project succeed" philosophy, which I've also
heard from Apache?

> You still point to arguing to contention -- it's more than that Todd. The project's
> policies for inclusivity have nothing to do with arguing about technical issues.

I'm absolutely for meritocracy. I just have a high bar for what should
be considered "merit". Perhaps the PMC as a whole has a high bar. For
a system that stores my data, I'm pretty happy about that.

>
> Dude, you have to do that regardless, that has nothing to do with *Apache Hadoop*.
> Take your Cloudera hat off and put your *Apache Software Foundation* hat on. Is your
> #1 priority developing software here to stitch code back together, turn it into a deliverable
> for your customers (I'm guessing Cloudera customers, right? B/c Apache doesn't have
> specific customers?) and to maintain green Jenkins builds?

Yes? I think so? If we do a bad release and it loses substantial data,
our user base would disappear quite quickly.

>
> Also tell me how the 4 SVN commands I suggested will stop you from doing the above?
> At Apache?

If the projects are on separate release schedules, this means that
cross-project changes have to be staged across the projects in such a
way that neither project breaks in the interim. All of our internal
APIs become public APIs. We worked like this for around a year during
the "project split" period. It was super complicated and our builds
were often red, we wasted a lot of time, and new users couldn't figure
out how to contribute.

In the absense of a reasonable *technical* strategy to release
independently, and a lot of work to stabilize internal APIs around
security and IPC in particular, doing it again would cause the same
problems it caused the first time.

It also makes the users' lives much more difficult, or forces them to
only consume via downstream packagers. Earlier in this thread, you
seemed to think that downstream packagers indicated an issue with the
community: fracturing the releases would only serve to make the ASF
download page even less useful for someone who just wants to get going
fast.
If the projects were on different release schedules, then we'd be more
likely to have to do a lot of local patching to get stuff to "fit
together" right. Version compatibility is a difficult problem - it
multiplies the QA matrix, complicates deployment, etc. It's not
insurmountable, but unless there's something to be gained (what is it,
again, that you think we'd gain, specifically?) I don't see why we'd
take this additional hassle.
Thanks for that. As for Apache vs Cloudera hat: I think they're well
aligned here. Both hats want the project to be easy for people to
contribute to, and want to avoid a bunch of wasted time spent on new
technical issues that this would create. I want to spend that time
making the product better, for our users benefit. Whether the users
are Apache community users, or Cloudera customers, or Facebook's data
scientists, they all are going to be happier if I spend a month
improving our HA support compared to spending a month figuring out how
to release three separate projects which somehow stitch together in a
reasonable way at runtime without jar conflicts, tons of duplicate
configuration work, byzantine version dependencies, etc.

-Todd
Todd Lipcon
Software Engineer, Cloudera