-Re: Sqoop is moving to github!
Aaron Kimball 2010-03-30, 17:52
These are some important questions about the status of the project; thanks
for asking them. I'll go through them inline to preserve context on each
On Tue, Mar 30, 2010 at 1:55 AM, Bernd Fondermann <
[EMAIL PROTECTED]> wrote:
> Hi Aaron,
> Good to see you are a contributor to sqoop. Are you a committer yet?
> Do you haven an ICLA on file with the ASF? I cannot find any record of it.
> I must be missing something here, since PMCs are normally requesting
> ICLAs from people making such substantial code contributions.
It might be more precise to call me *the* contributor to Sqoop. I've written
about 98% of the code for it; a few other individuals have provided me with
small enhancements or bugfixes, but the overwhelming amount of its care and
feeding has been under my watch.
I am not a committer on the Hadoop MapReduce (or any other ASF) project.
Thus far, nobody has invited me to sign an ICLA with my contributor-only
status. I have relied on others (primarily Tom White) to actually commit all
the Sqoop patches to svn.
> On Mon, Mar 29, 2010 at 21:02, Aaron Kimball <[EMAIL PROTECTED]> wrote:
> > Hi Hadoop, Hive, and Sqoop users,
> > For the past year, the Apache Hadoop MapReduce project has played host to
> > Sqoop, a command-line tool that performs parallel imports and exports
> > between relational databases and HDFS. We've developed a lot of features
> > gotten a lot of great feedback from users.
> Who is "we" exactly? Cloudera? Hadoop? You?
Both myself and Cloudera. As said above, the vast majority of the direct
work on the project has been my own. But there are others at Cloudera who
have helped in less visible fashion with feature prioritization, design
input, code review, QA, user support, etc. And the contributions I make to
Sqoop, I do so as an employee of Cloudera.
> > While Sqoop was a contrib project
> > in Hadoop, it has been steadily improved and grown.
> > But the contrib directory is a home for new or small projects incubating
> > underneath Hadoop's umbrella. Sqoop is starting to look less like a small
> > project these days. In particular, a feature that has been growing in
> > importance for Sqoop is its ability to integrate with Hive. In order to
> > facilitate this integration from a compilation and testing standpoint,
> > pulled Sqoop out of contrib and into its own repository hosted on github.
> So, you are forking sqoop. To facilitate that an Hadoop project can
> work with another Hadoop project.
> What are the issues with Hadoop that you cannot do it within Hadoop itself?
When you put it like that, "forking" seems like a bit of a strong term. As
said in my original email, I prefer to think of it as "moving." (See the
next answer below for more on this.)
I believe Owen has already described some of the technical problems.
Conflating Sqoop's source repository with Hadoop's causes unnecessary
circular dependencies that build tools cannot easily work around. The more
straightforward method is to factor out Sqoop into a separate source
> > You can download all the relevant bits here:
> > http://www.github.com/cloudera/sqoop
> > The code there will run in conjunction with the Apache Hadoop trunk
> > (Compatibility with other distributions/versions is forthcoming.)
> Sqoop is in ASF svn. What do you do when someone is going to continue
> developing it here.
> Then there's a naming clash. Do you intend to rename your fork?
I have filed MAPREDUCE-1644 with a patch that completely removes Sqoop from
the MapReduce repository. This will remove Sqoop from the working copy of
the repository, but of course, it will still belong to the ASF's repository
history. (Thus, I hope this will be seen as a straightforward lateral move
more than a fork.)
It's worth pointing out that Sqoop was originally introduced in HADOOP-5815,
committed after 0.20 was branched for release and closed to new features. So
Sqoop has only existed on unreleased development branches in ASF svn. As
such, removing new features from the working copy is still allowed. As
Hadoop is gearing up for a new release, now is the time to consider whether
side-projects like this belong in the same umbrella project.
Others can -1 the removal patch and force a copy of Sqoop to remain in ASF.
This will force Sqoop to be bundled with the impending Hadoop 0.21 release.
However, I do not intend to rename Sqoop. I also intend to do all feature
and bugfix development in the new repository on github. I will be monitoring
the issue tracker on github for bug reports and feature requests. For
someone else to seriously -1 MAPREDUCE-1644, they'd need to be willing to
fix bugs in the ASF copy themselves, or cross-port the patches I develop at
github and graft them on to the ASF copy.
If others are interested in contributing to Sqoop and would like to take on
a role in the project, I welcome them to come help me out at github, rather
than force a true fork to occur and work within MapReduce svn. If enough
people want to work on Sqoop but feel strongly that we should remain in the
ASF (e.g., by introducing a new project in the incubator), I'm certainly
open to listening to that point of view. But that's a separate discussion
from this one.
- Aaron Kimball