|
|
-
[PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Matt Foley 2012-11-21, 19:15
This discussion started in HADOOP-8924< https://issues.apache.org/jira/browse/HADOOP-8924>, where it was proposed to replace the build-time utility "saveVersion.sh" with a python script. This would require Python as a build-time dependency. Here's the background: Those of us involved in the branch-1-win port of Hadoop to Windows without use of Cygwin, have faced the issue of frequent use of shell scripts throughout the system, both in build time (eg, the utility "saveVersion.sh"), and run time (config files like "hadoop-env.sh" and the start/stop scripts in "bin/*" ). Similar usages exist throughout the Hadoop stack, in all projects. The vast majority of these shell scripts do not do anything platform specific; they can be expressed in a posix-conforming way. Therefore, it seems to us that it makes sense to start using a cross-platform scripting language, such as python, in place of shell for these purposes. For those rare occasions where platform-specific functionality really is needed, python also supports quite a lot of platform-specific functionality on both Linux and Windows; but where that is inadequate, one could still conditionally invoke a platform-specific module written in shell (for Linux/*nix) or powershell or bat (for Windows). The primary motive for moving to a cross-platform scripting language is maintainability. The alternative would be to maintain two complete suites of scripts, one for Linux and one for Windows (and perhaps others in the future). We want to avoid the need to update dual modules in two different languages when functionality changes, especially given that many Linux developers are not familiar with powershell or bat, and many Windows developers are not familiar with shell or bash. Regarding the choice of python: - There are already a few instances of python usage in Hadoop, such as the utility (currently broken) "relnotes.py", and massive usage of python in the examples/ and contrib/ directories. - Python is also used in Bigtop build-time. - The Python language is available for free on essentially all platforms, under an Apache-compatible license< http://www.apache.org/legal/resolved.html>. - It is supported in Eclipse and similar IDEs. - Most importantly, it is widely accepted as a reasonably good OO scripting language, and it is easily learned by anyone who already knows shell or perl, or other common scripting languages. - On the Tiobe index of programming language popularity< http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html>, which seeks to measure the relative number of software engineers who know and use each language, Python far exceeds Perl and Ruby. The only more well-known scripting languages are PHP and Visual Basic, neither of which seems a prime candidate for this use. For build-time usage, I think we should immediately approve python as a build-time dependency, and allow people who are motivated to do so, to open jiras for migrating existing build-time shell scripts to python. For run-time, there is likely to be a lot more discussion. Lots of folks, including me, aren't real happy with use of active scripts for configuration, and various others, including I believe some of the Bigtop folks, have issues with the way the start/stop scripts work. Nevertheless, all those scripts exist today and are widely used. And they present an impediment to porting to Windows-without-cygwin. Nothing about run-time use of scripts has changed significantly over the past three years, and I don't think we should hold up the Windows port while we have a huge discussion about issues that veer dangerously into religious/aesthetic domains. It would be fun to have that discussion, but I don't want this decision to be dependent on it! So I propose that we go ahead and also approve python as a run-time dependency, and allow the inclusion of python scripts in place of current shell-based functionality. The unpleasant alternative is to spawn a bunch of powershell scripts in parallel to the current shell scripts, with a very negative impact on maintainability. The Windows port must, after all, be allowed to proceed. Let's have a discussion, and then I'll put both issues, separately, to a vote (unless we miraculously achieve consensus without a vote :-) I also encourage members of the other Hadoop-related projects, to carry this discussion into those forums. It would be very cool to agree on a whole-stack solution for the scripting problem. Best regards,
+
Matt Foley 2012-11-21, 19:15
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Radim Kolar 2012-11-23, 23:40
discussion seems to ended, lets start vote.
+
Radim Kolar 2012-11-23, 23:40
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Matt Foley 2012-11-24, 20:13
Please see new [VOTE] thread.
On Fri, Nov 23, 2012 at 3:40 PM, Radim Kolar <[EMAIL PROTECTED]> wrote:
> discussion seems to ended, lets start vote. >
+
Matt Foley 2012-11-24, 20:13
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Radim Kolar 2012-11-24, 21:26
we have not discussed advantages of stand alone python vs jython-in-maven pom http://code.google.com/p/jy-maven-plugin/language is about same, and it does not needs to have installed, which is advantage on windows.
+
Radim Kolar 2012-11-24, 21:26
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Konstantin Boudnik 2012-11-24, 22:03
If we decide to go with Maven then there's no point to complicate the picture with jython. This time I will keep the offensive about *yton to myself ;) Cos On Sat, Nov 24, 2012 at 10:26PM, Radim Kolar wrote: > we have not discussed advantages of stand alone python vs > jython-in-maven pom > > http://code.google.com/p/jy-maven-plugin/> > language is about same, and it does not needs to have installed, > which is advantage on windows.
+
Konstantin Boudnik 2012-11-24, 22:03
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Alejandro Abdelnur 2012-11-21, 19:25
Hey Matt, We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on its way out with the move of docs to APT) Why not do a maven-plugin to do that? Colin already has something to simplify all the cmake calls from the builds using a maven-plugin ( https://issues.apache.org/jira/browse/HADOOP-8887)We could do the same with protoc, thus simplifying the POMs. The saveVersion.sh seems like another prime candidate for a maven plugin, and in this case it would not require external tools. Does this make sense? Thx On Wed, Nov 21, 2012 at 11:15 AM, Matt Foley <[EMAIL PROTECTED]> wrote: > This discussion started in > HADOOP-8924< https://issues.apache.org/jira/browse/HADOOP-8924>> , where it was proposed to replace the build-time utility "saveVersion.sh" > with a python script. This would require Python as a build-time > dependency. Here's the background: > > Those of us involved in the branch-1-win port of Hadoop to Windows without > use of Cygwin, have faced the issue of frequent use of shell scripts > throughout the system, both in build time (eg, the utility > "saveVersion.sh"), > and run time (config files like "hadoop-env.sh" and the start/stop scripts > in "bin/*" ). Similar usages exist throughout the Hadoop stack, in all > projects. > > The vast majority of these shell scripts do not do anything platform > specific; they can be expressed in a posix-conforming way. Therefore, it > seems to us that it makes sense to start using a cross-platform scripting > language, such as python, in place of shell for these purposes. For those > rare occasions where platform-specific functionality really is needed, > python also supports quite a lot of platform-specific functionality on both > Linux and Windows; but where that is inadequate, one could still > conditionally invoke a platform-specific module written in shell (for > Linux/*nix) or powershell or bat (for Windows). > > The primary motive for moving to a cross-platform scripting language is > maintainability. The alternative would be to maintain two complete suites > of scripts, one for Linux and one for Windows (and perhaps others in the > future). We want to avoid the need to update dual modules in two different > languages when functionality changes, especially given that many Linux > developers are not familiar with powershell or bat, and many Windows > developers are not familiar with shell or bash. > > Regarding the choice of python: > > - There are already a few instances of python usage in Hadoop, such as > the utility (currently broken) "relnotes.py", and massive usage of > python > in the examples/ and contrib/ directories. > - Python is also used in Bigtop build-time. > - The Python language is available for free on essentially all > platforms, under an Apache-compatible > license< http://www.apache.org/legal/resolved.html>. > > - It is supported in Eclipse and similar IDEs. > - Most importantly, it is widely accepted as a reasonably good OO > scripting language, and it is easily learned by anyone who already knows > shell or perl, or other common scripting languages. > - On the Tiobe index of programming language > popularity< > http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html>, > which seeks to measure the relative number of software engineers who > know > and use each language, Python far exceeds Perl and Ruby. The only more > well-known scripting languages are PHP and Visual Basic, neither of > which > seems a prime candidate for this use. > > For build-time usage, I think we should immediately approve python as a > build-time dependency, and allow people who are motivated to do so, to open > jiras for migrating existing build-time shell scripts to python. > > For run-time, there is likely to be a lot more discussion. Lots of folks, > including me, aren't real happy with use of active scripts for > configuration, and various others, including I believe some of the Bigtop > folks, have issues with the way the start/stop scripts work. Nevertheless, Alejandro
+
Alejandro Abdelnur 2012-11-21, 19:25
+
Radim Kolar 2012-11-21, 20:46
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Konstantin Boudnik 2012-11-21, 21:33
On Wed, Nov 21, 2012 at 09:46PM, Radim Kolar wrote: > > >Why not do a maven-plugin to do that? > maven plugins are difficult to maintain. its better to use inline > scripts, with something like this: > > http://docs.codehaus.org/display/GMAVEN/Home;jsessionid=E29093B96230BBB4461F02A1718A6B71Exactly my point, thank you! Cos
+
Konstantin Boudnik 2012-11-21, 21:33
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Konstantin Boudnik 2012-11-21, 20:00
I like Alejandro's idea about Maven for a few of reasons: - bringing in a scripting environment which is known for its inter-version idiosyncrasies just because Windows can't handle trivial shell scripting looks like an overkill to me - relative to above, there's a chance that Python's pre-requisites used in Hadoop might get into a conflict with some other components in the stack. This will be a nightmare for the integrator projects i.e. Bigtop - Maven is de-facto standard for Java stacks - Maven has built-in scripting language (Groovy) if some plugins aren't sufficient for achieving whatever goals Addressing Matt's later point about non-Mavenized Hadoop-1 line: it uses Maven stuff suchs as deploy/install via custom ant tasks. Same approach would work for saveVersion.sh and others, I am sure. Cos On Wed, Nov 21, 2012 at 11:25AM, Alejandro Abdelnur wrote: > Hey Matt, > > We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on > its way out with the move of docs to APT) > > Why not do a maven-plugin to do that? > > Colin already has something to simplify all the cmake calls from the builds > using a maven-plugin ( https://issues.apache.org/jira/browse/HADOOP-8887)> > We could do the same with protoc, thus simplifying the POMs. > > The saveVersion.sh seems like another prime candidate for a maven plugin, > and in this case it would not require external tools. > > Does this make sense? > > Thx > > On Wed, Nov 21, 2012 at 11:15 AM, Matt Foley <[EMAIL PROTECTED]> wrote: > > > This discussion started in > > HADOOP-8924< https://issues.apache.org/jira/browse/HADOOP-8924>> > , where it was proposed to replace the build-time utility "saveVersion.sh" > > with a python script. This would require Python as a build-time > > dependency. Here's the background: > > > > Those of us involved in the branch-1-win port of Hadoop to Windows without > > use of Cygwin, have faced the issue of frequent use of shell scripts > > throughout the system, both in build time (eg, the utility > > "saveVersion.sh"), > > and run time (config files like "hadoop-env.sh" and the start/stop scripts > > in "bin/*" ). Similar usages exist throughout the Hadoop stack, in all > > projects. > > > > The vast majority of these shell scripts do not do anything platform > > specific; they can be expressed in a posix-conforming way. Therefore, it > > seems to us that it makes sense to start using a cross-platform scripting > > language, such as python, in place of shell for these purposes. For those > > rare occasions where platform-specific functionality really is needed, > > python also supports quite a lot of platform-specific functionality on both > > Linux and Windows; but where that is inadequate, one could still > > conditionally invoke a platform-specific module written in shell (for > > Linux/*nix) or powershell or bat (for Windows). > > > > The primary motive for moving to a cross-platform scripting language is > > maintainability. The alternative would be to maintain two complete suites > > of scripts, one for Linux and one for Windows (and perhaps others in the > > future). We want to avoid the need to update dual modules in two different > > languages when functionality changes, especially given that many Linux > > developers are not familiar with powershell or bat, and many Windows > > developers are not familiar with shell or bash. > > > > Regarding the choice of python: > > > > - There are already a few instances of python usage in Hadoop, such as > > the utility (currently broken) "relnotes.py", and massive usage of > > python > > in the examples/ and contrib/ directories. > > - Python is also used in Bigtop build-time. > > - The Python language is available for free on essentially all > > platforms, under an Apache-compatible > > license< http://www.apache.org/legal/resolved.html>. > > > > - It is supported in Eclipse and similar IDEs. > > - Most importantly, it is widely accepted as a reasonably good OO
+
Konstantin Boudnik 2012-11-21, 20:00
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Matt Foley 2012-11-21, 21:14
Cos, Please see in-line. On Wed, Nov 21, 2012 at 12:00 PM, Konstantin Boudnik <[EMAIL PROTECTED]> wrote: > I like Alejandro's idea about Maven for a few of reasons: > - bringing in a scripting environment which is known for its > inter-version > idiosyncrasies just because Windows can't handle trivial shell > scripting > looks like an overkill to me > Excuse me? Can we at least try not to belittle other people's platforms on a public Apache forum? There's nothing trivial about implementing shell on Windows, as cygwin regrettably proved. > - relative to above, there's a chance that Python's pre-requisites used > in > Hadoop might get into a conflict with some other components in the > stack. > This will be a nightmare for the integrator projects i.e. Bigtop > Said Bigtop project actually uses python, does it not? > - Maven is de-facto standard for Java stacks > Sure -- except for when Ant was the de-facto standard for Java stacks. And let's remember what maven and ant are/were the de-facto standard for: Doing builds. Not scripting everything that needs scripting. > - Maven has built-in scripting language (Groovy) if some plugins aren't > sufficient for achieving whatever goals > Are you proposing Groovy as a better scripting language than Python? > > Addressing Matt's later point about non-Mavenized Hadoop-1 line: it uses > Maven > stuff suchs as deploy/install via custom ant tasks. Same approach would > work > for saveVersion.sh and others, I am sure. > Current ant scripts in Hadoop seem to use maven only for artifact management via the maven repository. If I'm missing something, please point it out. The ant build task currently calls out to saveVersion.sh. Having it call out to maven, which then calls out to a plug-in and/or a Groovy script, doesn't sound like an improvement to me. And it's a way different use of maven than currently in the Hadoop-1 line, not a continuation of established practice. --Matt > > Cos > > On Wed, Nov 21, 2012 at 11:25AM, Alejandro Abdelnur wrote: > > Hey Matt, > > > > We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on > > its way out with the move of docs to APT) > > > > Why not do a maven-plugin to do that? > > > > Colin already has something to simplify all the cmake calls from the > builds > > using a maven-plugin ( https://issues.apache.org/jira/browse/HADOOP-8887)> > > > We could do the same with protoc, thus simplifying the POMs. > > > > The saveVersion.sh seems like another prime candidate for a maven plugin, > > and in this case it would not require external tools. > > > > Does this make sense? > > > > Thx > > > > On Wed, Nov 21, 2012 at 11:15 AM, Matt Foley <[EMAIL PROTECTED]> wrote: > > > > > This discussion started in > > > HADOOP-8924< https://issues.apache.org/jira/browse/HADOOP-8924>> > > , where it was proposed to replace the build-time utility > "saveVersion.sh" > > > with a python script. This would require Python as a build-time > > > dependency. Here's the background: > > > > > > Those of us involved in the branch-1-win port of Hadoop to Windows > without > > > use of Cygwin, have faced the issue of frequent use of shell scripts > > > throughout the system, both in build time (eg, the utility > > > "saveVersion.sh"), > > > and run time (config files like "hadoop-env.sh" and the start/stop > scripts > > > in "bin/*" ). Similar usages exist throughout the Hadoop stack, in all > > > projects. > > > > > > The vast majority of these shell scripts do not do anything platform > > > specific; they can be expressed in a posix-conforming way. Therefore, > it > > > seems to us that it makes sense to start using a cross-platform > scripting > > > language, such as python, in place of shell for these purposes. For > those > > > rare occasions where platform-specific functionality really is needed, > > > python also supports quite a lot of platform-specific functionality on > both > > > Linux and Windows; but where that is inadequate, one could still
+
Matt Foley 2012-11-21, 21:14
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Konstantin Boudnik 2012-11-21, 21:50
Ditto... On Wed, Nov 21, 2012 at 01:14PM, Matt Foley wrote: > Cos, > Please see in-line. > > On Wed, Nov 21, 2012 at 12:00 PM, Konstantin Boudnik <[EMAIL PROTECTED]> wrote: > > > I like Alejandro's idea about Maven for a few of reasons: > > - bringing in a scripting environment which is known for its > > inter-version idiosyncrasies just because Windows can't handle trivial > > shell scripting looks like an overkill to me > > Excuse me? Can we at least try not to belittle other people's platforms on > a public Apache forum? There's nothing trivial about implementing shell on > Windows, as cygwin regrettably proved. Belittle? Hardly ;) Because we all know very well why shell is so awkward to implement on any Windows system. > > - relative to above, there's a chance that Python's pre-requisites used > > in Hadoop might get into a conflict with some other components in the > > stack. This will be a nightmare for the integrator projects i.e. Bigtop > > Said Bigtop project actually uses python, does it not? It does, Matt. The main concern I have is at some point Hadoop's Python might all of a sudden be of a different version than the one in BigTop. And all the hell will break lose compatibility wise. What would be the solution then? > > - Maven is de-facto standard for Java stacks > > > > Sure -- except for when Ant was the de-facto standard for Java stacks. And Arguable. Yet beyond the point. > let's remember what maven and ant are/were the de-facto standard for: > Doing builds. Not scripting everything that needs scripting. Arguable as well, due to the very definition of a build system. > > - Maven has built-in scripting language (Groovy) if some plugins aren't > > sufficient for achieving whatever goals > > Are you proposing Groovy as a better scripting language than Python? I am proposing Groovy is a better language than Python. Because, in part, it goes far beyond scripting. And doesn't have permanent runtime backward compatibility issues. What was the last time JDK had backward compatibility problems? > > Addressing Matt's later point about non-Mavenized Hadoop-1 line: it uses > > Maven > > stuff suchs as deploy/install via custom ant tasks. Same approach would > > work > > for saveVersion.sh and others, I am sure. > > Current ant scripts in Hadoop seem to use maven only for artifact > management via the maven repository. If I'm missing something, please > point it out. The ant build task currently calls out to saveVersion.sh. > Having it call out to maven, which then calls out to a plug-in and/or a > Groovy script, doesn't sound like an improvement to me. And it's a way At least it it guaranteed to work everywhere. And all we need in this case is an extra jar file that can be pulled down through the same ivy/maven dependency mechanism. In case of Python you'd have to make sure that you're having the right version of the interpreter and runtime. And you will have to do it manually or have an extra requirement expressed via a system maintenance DSL. > different use of maven than currently in the Hadoop-1 line, not a > continuation of established practice. The main point of my argument expressed in a lesser than 100 words: adding Python that is inconsistent across different Linux distros and has a history of backward incompatibilities (2.6 vs 2.5, 3.0 vs earlier, etc.) doesn't seem to leverage the benefit of having a somewhat easier build in Windows. Perhaps, we can do a more format benefit analysis by just comparing the number of Hadoop installations on MS Win vs. Unix's. Cos > > On Wed, Nov 21, 2012 at 11:25AM, Alejandro Abdelnur wrote: > > > Hey Matt, > > > > > > We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on > > > its way out with the move of docs to APT) > > > > > > Why not do a maven-plugin to do that? > > > > > > Colin already has something to simplify all the cmake calls from the > > builds > > > using a maven-plugin ( https://issues.apache.org/jira/browse/HADOOP-8887)
+
Konstantin Boudnik 2012-11-21, 21:50
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Andy Isaacson 2012-11-21, 23:00
On Wed, Nov 21, 2012 at 1:50 PM, Konstantin Boudnik <[EMAIL PROTECTED]> wrote: > The main point of my argument expressed in a lesser than 100 words: adding > Python that is inconsistent across different Linux distros and has a history > of backward incompatibilities (2.6 vs 2.5, 3.0 vs earlier, etc.) doesn't seem > to leverage the benefit of having a somewhat easier build in Windows. This seems mostly like a red herring to me. I'll grant that it's hard to build a complete Python stack that's compatible between Python 2.x and 2.y, but it's very easy by following best practices to keep python *scripts* compatible across all reasonable Python 2.x versions. Simply pick an oldest-supported-version like 2.4 and be reasonably disciplined to not use newer constructs or libraries. I wouldn't wish to try to build a complete system in such a limited dialect [1], but for "we need a reasonable replacement for /bin/sh scripts" it's just fine. Ignore Python 3 for the time being, it's a completely different language with incompatible syntax and semantics that doesn't support several currently-important platforms. Maybe in a few years sane people can consider moving to it, but for now it's best to just stick with the compatible subset of Python 2.x. [1] the Mercurial project has had a pretty good experience with this scheme; http://mercurial.selenic.com/wiki/SupportedPythonVersions they currently support 2.4 - 2.7 with a few required libraries. They dropped 2.2 and 2.3 support a few years ago due to specific shortcomings on those versions. -andy
+
Andy Isaacson 2012-11-21, 23:00
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Radim Kolar 2012-11-21, 23:58
/Ignore Python 3 for the time being, it's a completely different language with incompatible syntax and semantics that doesn't support several currently-important platforms. Maybe in a few years sane people can consider moving to it, but for now it's best to just stick with the compatible subset of Python 2.x. [1] the Mercurial project has had a pretty good experience with this scheme; http://mercurial.selenic.com/wiki/SupportedPythonVersions they currently support 2.4 - 2.7 with a few required libraries. They dropped 2.2 and 2.3 support a few years ago due to specific shortcomings on those versions./ I know that Python compatibility can be worked around. I used Python for few years and wrote about 70k LOC in it until it started to irritate me that every new version has incompatibilities such as 2.4 vs 2.3 vs 2.5 and it makes maintaining and testing way harder then it should be. Its not just compatibility with missing library functions. sometimes even expression evaluated to different value under new version. This was similar to php 4 to php 5 migration. Today i have 3 versions of python installed because of software requirements. For simple scripts it can probably work if you stick to some common subset. Scripting via maven plugin has advantage that user do not needs to install anything, there is couple of languages available: scala, groovy, jelly, jruby. Maybe jython too.
+
Radim Kolar 2012-11-21, 23:58
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Steve Loughran 2012-11-22, 09:21
On 21 November 2012 23:58, Radim Kolar <[EMAIL PROTECTED]> wrote: > > Scripting via maven plugin has advantage that user do not needs to install > anything, there is couple of languages available: scala, groovy, jelly, > jruby. Maybe jython too. > the JSR-233 bridge comes with a javascript interpreter built in, BTW. You can actually use it in ant's <script> and <scriptdef> tasks without even having to stick a new Jar on the CP. That doesn't mean it's ideal. There was recent discussion on bigtop dev about moving to a later version of groovy; Roman found they ran into some problem where the maven groovy code was reluctant to upgrade: http://groovy.329449.n5.nabble.com/groovy-maven-td4382545.html#a4382976
+
Steve Loughran 2012-11-22, 09:21
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Konstantin Boudnik 2012-11-22, 01:46
On Thu, Nov 22, 2012 at 12:58AM, Radim Kolar wrote: > I know that Python compatibility can be worked around. I used Python > for few years and wrote about 70k LOC in it until it started to > irritate me that every new version has incompatibilities such as 2.4 > vs 2.3 vs 2.5 and it makes maintaining and testing way harder then > it should be. Its not just compatibility with missing library > functions. sometimes even expression evaluated to different value > under new version. This was similar to php 4 to php 5 migration. > Today i have 3 versions of python installed because of software > requirements. > > For simple scripts it can probably work if you stick to some common subset. > > Scripting via maven plugin has advantage that user do not needs to > install anything, there is couple of languages available: scala, > groovy, jelly, jruby. Maybe jython too.
pretty much all of the j* in JSR223 land is abomination of one sort or another, actually :)
Cos
+
Konstantin Boudnik 2012-11-22, 01:46
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Radim Kolar 2012-11-22, 01:57
> pretty much all of the j* in JSR223 land is abomination of one sort or > another, actually :) jruby is good because you can run rails application on standard Java infrastructure which is way easier to maintain, then obscure Ruby application servers.
+
Radim Kolar 2012-11-22, 01:57
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Chris Nauroth 2012-11-21, 21:03
I worked on some of the Python build scripting that currently resides in branch-trunk-win. Initially, my goal was to keep a "pure" Maven implementation to the greatest degree possible without external scripting, but I encountered a few problems: 1. One approach is to try to express all of the build logic with existing Maven plugins. This turned out to be infeasible in some cases. I don't know of an existing plugin that does anything like the logic in saveVersion.sh/.py for walking the source tree and checksumming the files. For protoc, I saw a proposed plugin in open source, but it hadn't reached release status yet. For creation of the distribution tarballs, the Maven Ant Plugin (and actually the underlying Ant tool) cannot preserve file permissions or symlinks. 2. Considering that the first approach isn't possible, another possibility is to write custom Maven plugins. This would require significantly more engineering time to write and test the code. I think there are some legitimate concerns too about supportability, because this approach would put significant build logic into Maven plugin code instead of something more easily visible to release engineers, like pom.xml and external scripts. Also, I'm actually not sure that we can implement everything with a Maven plugin. For example, I mentioned the problem of preserving file permissions and symlinks in the distribution tarballs. Ant hasn't been able to fix that problem due to a Java limitation, so our Maven plugins coded in Java (or another JVM language) likely would suffer the same fate. We might be stuck with some amount of external scripting no matter what. Thank you, --Chris On Wed, Nov 21, 2012 at 12:00 PM, Konstantin Boudnik <[EMAIL PROTECTED]> wrote: > I like Alejandro's idea about Maven for a few of reasons: > - bringing in a scripting environment which is known for its > inter-version > idiosyncrasies just because Windows can't handle trivial shell > scripting > looks like an overkill to me > - relative to above, there's a chance that Python's pre-requisites used > in > Hadoop might get into a conflict with some other components in the > stack. > This will be a nightmare for the integrator projects i.e. Bigtop > - Maven is de-facto standard for Java stacks > - Maven has built-in scripting language (Groovy) if some plugins aren't > sufficient for achieving whatever goals > > Addressing Matt's later point about non-Mavenized Hadoop-1 line: it uses > Maven > stuff suchs as deploy/install via custom ant tasks. Same approach would > work > for saveVersion.sh and others, I am sure. > > Cos > > On Wed, Nov 21, 2012 at 11:25AM, Alejandro Abdelnur wrote: > > Hey Matt, > > > > We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on > > its way out with the move of docs to APT) > > > > Why not do a maven-plugin to do that? > > > > Colin already has something to simplify all the cmake calls from the > builds > > using a maven-plugin ( https://issues.apache.org/jira/browse/HADOOP-8887)> > > > We could do the same with protoc, thus simplifying the POMs. > > > > The saveVersion.sh seems like another prime candidate for a maven plugin, > > and in this case it would not require external tools. > > > > Does this make sense? > > > > Thx > > > > On Wed, Nov 21, 2012 at 11:15 AM, Matt Foley <[EMAIL PROTECTED]> wrote: > > > > > This discussion started in > > > HADOOP-8924< https://issues.apache.org/jira/browse/HADOOP-8924>> > > , where it was proposed to replace the build-time utility > "saveVersion.sh" > > > with a python script. This would require Python as a build-time > > > dependency. Here's the background: > > > > > > Those of us involved in the branch-1-win port of Hadoop to Windows > without > > > use of Cygwin, have faced the issue of frequent use of shell scripts > > > throughout the system, both in build time (eg, the utility > > > "saveVersion.sh"), > > > and run time (config files like "hadoop-env.sh" and the start/stop
+
Chris Nauroth 2012-11-21, 21:03
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Radim Kolar 2012-11-21, 21:30
Dne 21.11.2012 22:03, Chris Nauroth napsal(a): > For creation of the distribution tarballs, the Maven > Ant Plugin (and actually the underlying Ant tool) cannot preserve file > permissions or symlinks. maven assembly plugin can deal with file permissions. not sure about symlinks. I do not remember dist tar to have symlinks inside.
+
Radim Kolar 2012-11-21, 21:30
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Chris Nauroth 2012-11-21, 21:44
Sorry, to clarify my point a little more, Ant does allow you to make declarations to explicitly set the desired file permissions via the fileMode attribute of a tarfileset. However, it does not have the capability to preserve whatever permissions were naturally created on files earlier in the build process. This is a difference in maintainability, as adding new files to the build may then require extra maintenance of the Ant directives to apply the desired fileMode. This is an easy thing to overlook. A solution that preserves the natural permissions requires less maintenance overhead.
I couldn't find a way to make assembly plugin preserve permissions like this either. It just has explicit fileMode directives similar to Ant. (Let me know if I missed something though.)
To see symlinks show up in distribution tarballs, you need to build with the native components, like libhadoop.so or bundled Snappy.
Thanks, --Chris On Wed, Nov 21, 2012 at 1:30 PM, Radim Kolar <[EMAIL PROTECTED]> wrote:
> Dne 21.11.2012 22:03, Chris Nauroth napsal(a): > > For creation of the distribution tarballs, the Maven >> Ant Plugin (and actually the underlying Ant tool) cannot preserve file >> permissions or symlinks. >> > maven assembly plugin can deal with file permissions. not sure about > symlinks. I do not remember dist tar to have symlinks inside. >
+
Chris Nauroth 2012-11-21, 21:44
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Radim Kolar 2012-11-21, 23:15
Dne 21.11.2012 22:44, Chris Nauroth napsal(a): > Sorry, to clarify my point a little more, Ant does allow you to make > declarations to explicitly set the desired file permissions via the > fileMode attribute of a tarfileset. there are just 2 directories /bin and /sbin with executable files. Its probably possible to set file mode per directory in maven assembly plugin.
+
Radim Kolar 2012-11-21, 23:15
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Chris Nauroth 2012-11-22, 00:14
Unfortunately, there are a couple of spots where it gets really messy and directory-wide rules fail to cover it. The trickiest maintenance issue is hadoop-hdfs-httpfs, where we unpack and repack a Tomcat. Initially, I tried to do this using only the ant plugin, but I wound up with a ton of different tarfileset directives with different fileMode values to reapply the same permissions that were present in the original Tomcat distribution. This also would have been a brittle solution, because changes in the Tomcat package would risk invalidating our ant rules. A solution that preserves the original permissions reduces this kind of maintenance work.
Thanks, --Chris
On Wed, Nov 21, 2012 at 3:15 PM, Radim Kolar <[EMAIL PROTECTED]> wrote:
> Dne 21.11.2012 22:44, Chris Nauroth napsal(a): > > Sorry, to clarify my point a little more, Ant does allow you to make >> declarations to explicitly set the desired file permissions via the >> fileMode attribute of a tarfileset. >> > there are just 2 directories /bin and /sbin with executable files. Its > probably possible to set file mode per directory in maven assembly plugin. >
+
Chris Nauroth 2012-11-22, 00:14
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Radim Kolar 2012-11-22, 01:55
Dne 22.11.2012 1:14, Chris Nauroth napsal(a): > The trickiest maintenance issue is hadoop-hdfs-httpfs, where we unpack and repack a Tomcat. why its not possible to just ship WAR file? Its seems to be special purpose app and they needs hand security setup anyway and intergration with existing firewall/web infrastructure. did you considered to use Jetty? it has really good maven support: http://wiki.eclipse.org/Jetty/Feature/Jetty_Maven_PluginI am using jetty 8 instead of tomcat and run it with java -jar start.jar no extra file permissions like x bit are needed. If you really need to create tar by hand, there is java library for doing it - http://code.google.com/p/jtar/ and it can be used from any JVM based script language, you have plenty of choices.
+
Radim Kolar 2012-11-22, 01:55
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Chris Nauroth 2012-11-22, 02:40
This predates me, so I don't know the rationale for repackaging Tomcat inside HTTPFS. I suspect that there was a desire to create a fully stand-alone distribution package, including a full web server. The Maven Jetty plugin isn't directly applicable to this use case. I don't know why it was decided to use Tomcat instead of Jetty. (If anyone else out there has the background, please respond.) Regardless, if the desire is to package a full web server instead of just the war, then switching to Jetty would not change the challenges of the build process. We'd still need to preserve whatever permissions are present in the Jetty distribution. In general, when I was working on this, I did not question whether the current packaging was "correct". I assumed that whatever changes I made for Windows compatibility must yield the exact same distribution without changes on currently supported platforms like Linux. If there are questions around actually changing the output of the build process, then that will steer the conversation in another direction and increase the scope of this effort. It seems like the trickiest issue is preservation of permissions and symlinks in tar files. I suspect that any JVM-based solution like custom Maven plugins, Groovy, or jtar would be limited in this respect. According to Ant documentation, it's a JDK limitation, so I suspect all of these would have the same problem. I haven't tried any of them though. (If there was a feasible solution, then Ant likely would have incorporated it long ago.) If anyone wants to try though, we might learn something from that. Thank you, --Chris On Wed, Nov 21, 2012 at 5:55 PM, Radim Kolar <[EMAIL PROTECTED]> wrote: > Dne 22.11.2012 1:14, Chris Nauroth napsal(a): > > The trickiest maintenance issue is hadoop-hdfs-httpfs, where we unpack >> and repack a Tomcat. >> > why its not possible to just ship WAR file? Its seems to be special > purpose app and they needs hand security setup anyway and intergration with > existing firewall/web infrastructure. > > did you considered to use Jetty? it has really good maven support: > http://wiki.eclipse.org/Jetty/**Feature/Jetty_Maven_Plugin<http://wiki.eclipse.org/Jetty/Feature/Jetty_Maven_Plugin>> I am using jetty 8 instead of tomcat and run it with java -jar start.jar > no extra file permissions like x bit are needed. > > If you really need to create tar by hand, there is java library for doing > it - http://code.google.com/p/jtar/ and it can be used from any JVM based > script language, you have plenty of choices. >
+
Chris Nauroth 2012-11-22, 02:40
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Radim Kolar 2012-11-22, 14:54
> We'd still need to preserve whatever permissions are present in the Jetty distribution. in jetty distribution there is just one shell startup script and you can even run jetty without it using autostartable jar. Requirement to preserve permissions is overkill. at most you need just to chmod +x one script. In tomcat it would be similar. > Maven plugins, Groovy, or jtar would be limited in this respect. In jtar you are manipulating resulting tar file directly: http://code.google.com/p/jtar/source/browse/#svn%2Ftrunk%2Fjtar%2Fsrc%2Fmain%2Fjava%2Forg%2Fxeustechnologies%2Fjtar
+
Radim Kolar 2012-11-22, 14:54
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Steve Loughran 2012-11-22, 09:02
On 22 November 2012 02:40, Chris Nauroth <[EMAIL PROTECTED]> wrote:
> > It seems like the trickiest issue is preservation of permissions and > symlinks in tar files. I suspect that any JVM-based solution like custom > Maven plugins, Groovy, or jtar would be limited in this respect. According > to Ant documentation, it's a JDK limitation, so I suspect all of these > would have the same problem. I haven't tried any of them though. (If > there was a feasible solution, then Ant likely would have incorporated it > long ago.) If anyone wants to try though, we might learn something from > that. > > Thank you, > --Chris > > You are limited by what File.canRead(), canWrite() and canExecute) tell you.
The absence of a way to detect file permissions in Java -is because of the lowest-common-denominator approach of the JavaFS APIs, supporting FAT32 (odd case logic, no perms or symlinks), NTFS (odd case logic, ACLs over perms, symlinks historically very hard to create), HFS+ (case insensitive unix fs!) as well as classic unixy filesystems.
Ant <tarfileset> filesets in <tar> let you spec permissions on filesets you pull into the tar; they are generated x-platform, which the other reason why you declare them in <tar> -you have the right to generate proper tar files even if you use a Windows box.
symlinks are problematic -even detecting them cross platform is pretty unreliable. To really do them you'd need to add a new <symlinkfileset> entity for <tar>, that would take the link declaration. I could imagine how to do that -and if stuck into the hadoop tools JAR, wouldn't even depend on a new version of Ant.
Maven just adds extra layers in the way.
-Steve
+
Steve Loughran 2012-11-22, 09:02
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Matt Foley 2012-11-21, 19:44
Hi Alejandro, For build-time issues in branch-2 and beyond, this may make sense (although I'm concerned about obscuring functionality in a way that only maven experts will be able to understand). In the particular case of saveVersion.sh, I'd be happy to see it done automatically by the build tools. However, for build-time issues in the non-mavenized branch-1, and for run-time issues in both worlds, the need for cross-platform scripting remains. Thanks, --Matt On Wed, Nov 21, 2012 at 11:25 AM, Alejandro Abdelnur <[EMAIL PROTECTED]>wrote: > Hey Matt, > > We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on > its way out with the move of docs to APT) > > Why not do a maven-plugin to do that? > > Colin already has something to simplify all the cmake calls from the builds > using a maven-plugin ( https://issues.apache.org/jira/browse/HADOOP-8887)> > We could do the same with protoc, thus simplifying the POMs. > > The saveVersion.sh seems like another prime candidate for a maven plugin, > and in this case it would not require external tools. > > Does this make sense? > > Thx > > -- > Alejandro >
+
Matt Foley 2012-11-21, 19:44
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Alejandro Abdelnur 2012-11-21, 19:58
Got it, thx. BTW, for branch-1, how about doing an ant task as part of the build that does that. Thx On Wed, Nov 21, 2012 at 11:44 AM, Matt Foley <[EMAIL PROTECTED]> wrote: > Hi Alejandro, > For build-time issues in branch-2 and beyond, this may make sense (although > I'm concerned about obscuring functionality in a way that only maven > experts will be able to understand). In the particular case of > saveVersion.sh, I'd be happy to see it done automatically by the build > tools. > > However, for build-time issues in the non-mavenized branch-1, and for > run-time issues in both worlds, the need for cross-platform scripting > remains. > > Thanks, > --Matt > > On Wed, Nov 21, 2012 at 11:25 AM, Alejandro Abdelnur <[EMAIL PROTECTED] > >wrote: > > > Hey Matt, > > > > We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on > > its way out with the move of docs to APT) > > > > Why not do a maven-plugin to do that? > > > > Colin already has something to simplify all the cmake calls from the > builds > > using a maven-plugin ( https://issues.apache.org/jira/browse/HADOOP-8887)> > > > We could do the same with protoc, thus simplifying the POMs. > > > > The saveVersion.sh seems like another prime candidate for a maven plugin, > > and in this case it would not require external tools. > > > > Does this make sense? > > > > Thx > > > > -- > > Alejandro > > > -- Alejandro
+
Alejandro Abdelnur 2012-11-21, 19:58
-
Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack
Steve Loughran 2012-11-22, 09:14
On 21 November 2012 19:15, Matt Foley <[EMAIL PROTECTED]> wrote:
> This discussion started in > > > Those of us involved in the branch-1-win port of Hadoop to Windows without > use of Cygwin, have faced the issue of frequent use of shell scripts > throughout the system, both in build time (eg, the utility > "saveVersion.sh"), > and run time (config files like "hadoop-env.sh" and the start/stop scripts > in "bin/*" ). Similar usages exist throughout the Hadoop stack, in all > projects. > > The vast majority of these shell scripts do not do anything platform > specific; they can be expressed in a posix-conforming way. Therefore, it > seems to us that it makes sense to start using a cross-platform scripting > language, such as python, in place of shell for these purposes. For those > rare occasions where platform-specific functionality really is needed, > python also supports quite a lot of platform-specific functionality on both > Linux and Windows; but where that is inadequate, one could still > conditionally invoke a platform-specific module written in shell (for > Linux/*nix) or powershell or bat (for Windows). > > The primary motive for moving to a cross-platform scripting language is > maintainability. The alternative would be to maintain two complete suites > of scripts, one for Linux and one for Windows (and perhaps others in the > future). We want to avoid the need to update dual modules in two different > languages when functionality changes, especially given that many Linux > developers are not familiar with powershell or bat, and many Windows > developers are not familiar with shell or bash. > > I'd argue that a lot of Hadoop java developers aren't that familiar with bash. It's only in the last six months that I've come to hate it properly.
In the ant project, it was the launcher scripts that had the worst bugrep:line ratio, as -variations in .sh behaviour, especially under cygwin, but also things that weren't bash (AIX, ...) -requirements of the entire unix command set for real work -variants in the parameters/behaviour of those commands between Linux and other widely used Unix systems (e.g. OSX) -lack of inclusion of the .sh scripts in the junit test suite -lack of understanding of bash.
In the ant project we added a Python launcher in, what, 2001, based on the Perl launcher supplied by one [EMAIL PROTECTED]ceforge > For run-time, there is likely to be a lot more discussion. Lots of folks, > including me, aren't real happy with use of active scripts for > configuration, and various others, including I believe some of the Bigtop > folks, have issues with the way the start/stop scripts work. Nevertheless, > all those scripts exist today and are widely used. And they present an > impediment to porting to Windows-without-cygwin. >
They're a maintenance and support cost on Unix. Too many scripts, even more in Yarn, weakly-nondeterministic logic for loading env variables, especially between init.d and bin/hadoop; not much diagnostics. And as with Ant, a relatively under-comprehended language with no unit test coverage.
I'd replace the bash logic with python for Unix dev and maintenance alone. You could put your logic into a shared python module in usr/lib/hadoop/bin , have PyUnit test the inner functions as part of the build and test process (& jenkins). > > Nothing about run-time use of scripts has changed significantly over the > past three years, and I don't think we should hold up the Windows port > while we have a huge discussion about issues that veer dangerously into > religious/aesthetic domains. It would be fun to have that discussion, but I > don't want this decision to be dependent on it! > > With Yarn its got more complex. More env variables to set, more support calls when they aren't. > So I propose that we go ahead and also approve python as a run-time > dependency, and allow the inclusion of python scripts in place of current > shell-based functionality. The unpleasant alternative is to spawn a bunch +1 to any vote to allow .py at run time as a new feature
=0 to ripping out and replacing the existing .sh scripts with python code, as even though I don't like the scripts, replacing them could be traumatic downstream.
+1 to a gradual migration to .py for new code, starting with the yarn scripts.
+
Steve Loughran 2012-11-22, 09:14
|
|