for the past couple of releases of Hadoop 2.X code line the issue
of integration between Hadoop and its downstream projects has
become quite a thorny issue. The poster child here is Oozie, where
every release of Hadoop 2.X seems to be breaking the compatibility
in various unpredictable ways. At times other components (such
as HBase for example) also seem to be affected.
Now, to be extremely clear -- I'm NOT talking about the *latest* version
of Oozie working with the *latest* version of Hadoop, instead
my observations come from running previous *stable* releases
of Bigtop on top of Hadoop 2.X RCs.
As many of you know Apache Bigtop aims at providing a single
platform for integration of Hadoop and Hadoop ecosystem projects.
As such we're uniquely positioned to track compatibility between
different Hadoop releases with regards to the downstream components
(things like Oozie, Pig, Hive, Mahout, etc.). Every single single RC
we've been pretty diligent at trying to provide integration-level feedback
on the quality of the upcoming release, but it seems that our efforts
don't quite suffice in Hadoop 2.X stabilizing.
Of course, one could argue that while Hadoop 2.X code line was
designated 'alpha' expecting much in the way of perfect integration
and compatibility was NOT what the Hadoop community was
focusing on. I can appreciate that view, but what I'm interested in
is the future of Hadoop 2.X not its past. Hence, here's my question
to all of you as a Hadoop community at large:
Do you guys think that the project have reached a point where integration
and compatibility issues should be prioritized really high on the list
of things that make or break each future release?
The good news, is that Bigtop's charter is in big part *exactly* about
providing you with this kind of feedback. We can easily tell you when
Hadoop behavior, with regard to downstream components, changes
between a previous stable release and the new RC (or even branch/trunk).
What we can NOT do is submit patches for all the issues. We are simply
too small a project and we need your help with that.
I truly believe that we owe it to the downstream projects, and in the
second half of this email I will try to convince you of that.
We all know that integration projects are impossible to pull off
unless there's a general consensus between all of the projects involved
that they indeed need to work with each other. You can NOT force
that notion, but you can always try to influence. This relationship
goes both ways.
Consider a question in front of the downstream communities
of whether or not to adopt Hadoop 2.X as the basis. To answer
that question each downstream project has to be reasonably
sure that their concerns will NOT fall on deaf ears and that
Hadoop developers are, essentially, 'ready' for them to pick
up Hadoop 2.X. I would argue that so far the Hadoop community
had gone out of its way to signal that 2.X codeline is NOT
ready for the downstream.
I would argue that moving forward this is a really unfortunate
situation that may end up undermining the long term success
of Hadoop 2.X if we don't start addressing the problem. Think
about it -- 90% of unit tests that run downstream on Apache
infrastructure are still exercising Hadoop 1.X underneath.
In fact, if you were to forcefully make, lets say, HBase's
unit tests run on top of Hadoop 2.X quite a few of them
are going to fail. Hadoop community is, in effect, cutting
itself off from the biggest source of feedback -- its downstream
users. This in turn:
* leaves Hadoop project in a perpetual state of broken
* leaves Apache Hadoop 2.X releases in a state considerably
inferior to the releases *including* Apache Hadoop done by the
vendors. The users have no choice but to alight themselves
with vendor offerings if they wish to utilize latest Hadoop functionality.
The artifact that is know as Apache Hadoop 2.X stopped being
a viable choice thus fracturing the user community and reducing
the benefits of a commonly deployed codebase.
* leaves downstream projects of Hadoop in a jaded state where
they legitimately get very discouraged and frustrated and eventually
give up thinking that -- well, we work with one release of Hadoop
(the stable one Hadoop 1.X) and we shall wait for the Hadoop
community to get their act together.
In my view (shared by quite a few members of the Apache Bigtop) we
can definitely do better than this if we all agree that the proposed
first 'beta' release of Hadoop 2.0.4 is the right time for it to happen.
It is about time Hadoop 2.X community wins back all those end users
and downstream projects that got left behind during the alpha