Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # dev >> [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Copy link to this message
[PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
This discussion started in
, where it was proposed to replace the build-time utility "saveVersion.sh"
with a python script.  This would require Python as a build-time
dependency.  Here's the background:

Those of us involved in the branch-1-win port of Hadoop to Windows without
use of Cygwin, have faced the issue of frequent use of shell scripts
throughout the system, both in build time (eg, the utility "saveVersion.sh"),
and run time (config files like "hadoop-env.sh" and the start/stop scripts
in "bin/*" ).  Similar usages exist throughout the Hadoop stack, in all

The vast majority of these shell scripts do not do anything platform
specific; they can be expressed in a posix-conforming way.  Therefore, it
seems to us that it makes sense to start using a cross-platform scripting
language, such as python, in place of shell for these purposes.  For those
rare occasions where platform-specific functionality really is needed,
python also supports quite a lot of platform-specific functionality on both
Linux and Windows; but where that is inadequate, one could still
conditionally invoke a platform-specific module written in shell (for
Linux/*nix) or powershell or bat (for Windows).

The primary motive for moving to a cross-platform scripting language is
maintainability.  The alternative would be to maintain two complete suites
of scripts, one for Linux and one for Windows (and perhaps others in the
future).  We want to avoid the need to update dual modules in two different
languages when functionality changes, especially given that many Linux
developers are not familiar with powershell or bat, and many Windows
developers are not familiar with shell or bash.

Regarding the choice of python:

   - There are already a few instances of python usage in Hadoop, such as
   the utility (currently broken) "relnotes.py", and massive usage of python
   in the examples/ and contrib/ directories.
   - Python is also used in Bigtop build-time.
   - The Python language is available for free on essentially all
   platforms, under an Apache-compatible

   - It is supported in Eclipse and similar IDEs.
   - Most importantly, it is widely accepted as a reasonably good OO
   scripting language, and it is easily learned by anyone who already knows
   shell or perl, or other common scripting languages.
   - On the Tiobe index of programming language
   which seeks to measure the relative number of software engineers who know
   and use each language, Python far exceeds Perl and Ruby.  The only more
   well-known scripting languages are PHP and Visual Basic, neither of which
   seems a prime candidate for this use.

For build-time usage, I think we should immediately approve python as a
build-time dependency, and allow people who are motivated to do so, to open
jiras for migrating existing build-time shell scripts to python.

For run-time, there is likely to be a lot more discussion.  Lots of folks,
including me, aren't real happy with use of active scripts for
configuration, and various others, including I believe some of the Bigtop
folks, have issues with the way the start/stop scripts work.  Nevertheless,
all those scripts exist today and are widely used.  And they present an
impediment to porting to Windows-without-cygwin.

Nothing about run-time use of scripts has changed significantly over the
past three years, and I don't think we should hold up the Windows port
while we have a huge discussion about issues that veer dangerously into
religious/aesthetic domains. It would be fun to have that discussion, but I
don't want this decision to be dependent on it!

So I propose that we go ahead and also approve python as a run-time
dependency, and allow the inclusion of python scripts in place of current
shell-based functionality.  The unpleasant alternative is to spawn a bunch
of powershell scripts in parallel to the current shell scripts, with a very
negative impact on maintainability.  The Windows port must, after all, be
allowed to proceed.

Let's have a discussion, and then I'll put both issues, separately, to a
vote (unless we miraculously achieve consensus without a vote :-)

I also encourage members of the other Hadoop-related projects, to carry
this discussion into those forums.  It would be very cool to agree on a
whole-stack solution for the scripting problem.

Best regards,
Radim Kolar 2012-11-23, 23:40
Matt Foley 2012-11-24, 20:13
Radim Kolar 2012-11-24, 21:26
Konstantin Boudnik 2012-11-24, 22:03
Alejandro Abdelnur 2012-11-21, 19:25
Radim Kolar 2012-11-21, 20:46
Konstantin Boudnik 2012-11-21, 21:33
Konstantin Boudnik 2012-11-21, 20:00
Matt Foley 2012-11-21, 21:14
Konstantin Boudnik 2012-11-21, 21:50
Andy Isaacson 2012-11-21, 23:00
Radim Kolar 2012-11-21, 23:58
Steve Loughran 2012-11-22, 09:21
Konstantin Boudnik 2012-11-22, 01:46
Radim Kolar 2012-11-22, 01:57
Chris Nauroth 2012-11-21, 21:03
Radim Kolar 2012-11-21, 21:30
Chris Nauroth 2012-11-21, 21:44
Radim Kolar 2012-11-21, 23:15
Chris Nauroth 2012-11-22, 00:14
Radim Kolar 2012-11-22, 01:55
Chris Nauroth 2012-11-22, 02:40
Radim Kolar 2012-11-22, 14:54
Steve Loughran 2012-11-22, 09:02
Matt Foley 2012-11-21, 19:44
Alejandro Abdelnur 2012-11-21, 19:58
Steve Loughran 2012-11-22, 09:14