|
Olga Natkovich
2011-11-07, 19:15
Alejandro Abdelnur
2011-11-07, 19:40
Daniel Dai
2011-11-07, 21:42
Alan Gates
2011-11-08, 16:04
Russell Jurney
2011-11-08, 16:17
Dmitriy Ryaboy
2011-11-08, 18:05
|
-
[DISCUSSION]Pig releases with different versions of HadoopOlga Natkovich 2011-11-07, 19:15
Hi,
In the past we have for the most part avoided supporting multiple versions of Hadoop with the same version of Pig. This is about to change with release of Hadoop 23. We need to come up with a strategy on how to support that. There are a couple of issues to consider: (1) Version numbering. Seems like encoding the information in the last version number makes sense. The details of the encoding need to be hashed out (2) Code changes required to support different version of Hadoop. This time around we made an effort to make sure that the same code can work with both. In the future that might not work and we would need to figure out how to maintain different code base. Most likely we would have to have additional branches off of main release branch (3) Anything else we need to consider? Olga
-
Re: [DISCUSSION]Pig releases with different versions of HadoopAlejandro Abdelnur 2011-11-07, 19:40
Hi Olga,
Regarding #1, does this means we'd have a build of Pig X for each version of Hadoop we support? It seems to me this would be a bit complex to maintain. Regarding #2, If Hadoop does a good job at maintaing public API backwards compatibility and Pig uses only Hadoop public API we would be good. Regarding #3, still I can see potential issues (from my experience with Hadoop-Oozie) where the API did not change but the behavior dir. This means we'll have to be able to if/then/else within Pig whenever necessary based on the version of Hadoop. A possible way of addressing this would be: * Pig should use the 'hadoop' to run Pig (this would help to cleanly bring into the classpath the Hadoop depedencies). * Pig could have a whitelist of Hadoop version it supports and fail if the current hadoop version is not supported (we could use version regex/ranges) * (what I'm suggesting in #3 above) Pig could use the Hadoop version as a code selector whenever necessary. Thanks. Alejandro On Mon, Nov 7, 2011 at 11:15 AM, Olga Natkovich <[EMAIL PROTECTED]> wrote: > Hi, > > In the past we have for the most part avoided supporting multiple versions of Hadoop with the same version of Pig. This is about to change with release of Hadoop 23. We need to come up with a strategy on how to support that. There are a couple of issues to consider: > > > (1) Version numbering. Seems like encoding the information in the last version number makes sense. The details of the encoding need to be hashed out > > (2) Code changes required to support different version of Hadoop. This time around we made an effort to make sure that the same code can work with both. In the future that might not work and we would need to figure out how to maintain different code base. Most likely we would have to have additional branches off of main release branch > > (3) Anything else we need to consider? > > Olga >
-
Re: [DISCUSSION]Pig releases with different versions of HadoopDaniel Dai 2011-11-07, 21:42
Hi, Alejandro,
I understand your concern but creating multiple pig.jar is inevitable. See my comments below. Daniel On Mon, Nov 7, 2011 at 11:40 AM, Alejandro Abdelnur <[EMAIL PROTECTED]>wrote: > Hi Olga, > > Regarding #1, does this means we'd have a build of Pig X for each > version of Hadoop we support? It seems to me this would be a bit > complex to maintain. > Yes. Currently we only have plan to support 20.x and 23 (There is some work for hadoop 22 in PIG-2277 <https://issues.apache.org/jira/browse/PIG-2277>, but I don't know how it would end up). This is complex but I cannot see how we can avoid it. Hopefully hadoop will converge and become API stable, so that we don't need to do this trick in future hadoop release. > > Regarding #2, If Hadoop does a good job at maintaing public API > backwards compatibility and Pig uses only Hadoop public API we would > be good. > That's not true at least for 23 new apis. > > Regarding #3, still I can see potential issues (from my experience > with Hadoop-Oozie) where the API did not change but the behavior dir. > This means we'll have to be able to if/then/else within Pig whenever > necessary based on the version of Hadoop. > We already do such trick if we can solve the version divergence by using if/then/else or reflection. In that we only need to maintain only pig.jar. However, there are some static dependencies which cannot be solved by these tricks, that's why we do need a shims layer and generate different pig.jar for different version of hadoop. > > A possible way of addressing this would be: > > * Pig should use the 'hadoop' to run Pig (this would help to cleanly > bring into the classpath the Hadoop depedencies). > We've already done in PIG-2239 > * Pig could have a whitelist of Hadoop version it supports and fail if > the current hadoop version is not supported (we could use version > regex/ranges) > * (what I'm suggesting in #3 above) Pig could use the Hadoop version > as a code selector whenever necessary. > > Thanks. > > Alejandro > > On Mon, Nov 7, 2011 at 11:15 AM, Olga Natkovich <[EMAIL PROTECTED]> > wrote: > > Hi, > > > > In the past we have for the most part avoided supporting multiple > versions of Hadoop with the same version of Pig. This is about to change > with release of Hadoop 23. We need to come up with a strategy on how to > support that. There are a couple of issues to consider: > > > > > > (1) Version numbering. Seems like encoding the information in the > last version number makes sense. The details of the encoding need to be > hashed out > > > > (2) Code changes required to support different version of Hadoop. > This time around we made an effort to make sure that the same code can work > with both. In the future that might not work and we would need to figure > out how to maintain different code base. Most likely we would have to have > additional branches off of main release branch > > > > (3) Anything else we need to consider? > > > > Olga > > >
-
Re: [DISCUSSION]Pig releases with different versions of HadoopAlan Gates 2011-11-08, 16:04
On Nov 7, 2011, at 11:15 AM, Olga Natkovich wrote: > Hi, > > In the past we have for the most part avoided supporting multiple versions of Hadoop with the same version of Pig. This is about to change with release of Hadoop 23. We need to come up with a strategy on how to support that. There are a couple of issues to consider: > > > (1) Version numbering. Seems like encoding the information in the last version number makes sense. The details of the encoding need to be hashed out I can see two options. One is to do major.minor.patch.hadoopversion, so for example 0.10.1.h23 and 0.10.1.h20. The problem I see with that is we *have* to guarantee that they have the same functionality. That is, 0.10.1 has all the same patches regardless of which Hadoop version it is (excepting maybe patches specific to a particular Hadoop version), the only difference is which one it's compiled for. Another problem is that this will proliferate versions, cluttering up our website, confusing our users, and causing the PMC members vote after vote. The second option would be to rework the pig package so that it had the jars for both, and the pig shell script figures out based on the Hadoop it finds which version is being used. This has the nice feature of guaranteeing the same features, but it has a few downsides. One, it bloats our package (since it's carrying multiple jars). Two, what happens when someone wants to add support for a new version (say Hadoop 22) to an existing release? Three, now a release manager must have access to all versions of Hadoop we claim to cover, or wait for help from those who do, in order to test a release. Hive chose the second option, and dealt with the bloating issue by isolating all the version specific code in one jar. We could deal with the concern of adding new versions to an existing release by saying it's not allowed. If you want to add a new supported version then you create a new version. This will devolve into versions 0.10 and 0.12 work on 20 and 23, but 0.11 works on 22. That will be horribly confusing for our users. I think the third issue of testability is going to mean certain Pig versions only support certain Hadoop versions without it being explicitly marked as well. Again, I think this is really bad. So I vote for the major.minor.patch.hadoopversion solution, though I think we should work hard to make it clear to users how to select the right version of Pig when downloading it. > > (2) Code changes required to support different version of Hadoop. This time around we made an effort to make sure that the same code can work with both. In the future that might not work and we would need to figure out how to maintain different code base. Most likely we would have to have additional branches off of main release branch Hopefully we can continue to do this via conditional compilation. Having different branches isn't maintainable. How do I push a Hadoop version specific patch to the next release? We'll get an ever growing collection of patches that have to be applied on a Hadoop specific branch for every release. We need to continue the rule that any patch must apply to the trunk, even when it's version specific. > > (3) Anything else we need to consider? > > Olga Alan.
-
Re: [DISCUSSION]Pig releases with different versions of HadoopRussell Jurney 2011-11-08, 16:17
Option 2 is consistent with 'Pigs eat anything.'
Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com On Nov 8, 2011, at 8:05 AM, Alan Gates <[EMAIL PROTECTED]> wrote: > > On Nov 7, 2011, at 11:15 AM, Olga Natkovich wrote: > >> Hi, >> >> In the past we have for the most part avoided supporting multiple versions of Hadoop with the same version of Pig. This is about to change with release of Hadoop 23. We need to come up with a strategy on how to support that. There are a couple of issues to consider: >> >> >> (1) Version numbering. Seems like encoding the information in the last version number makes sense. The details of the encoding need to be hashed out > > I can see two options. One is to do major.minor.patch.hadoopversion, so for example 0.10.1.h23 and 0.10.1.h20. The problem I see with that is we *have* to guarantee that they have the same functionality. That is, 0.10.1 has all the same patches regardless of which Hadoop version it is (excepting maybe patches specific to a particular Hadoop version), the only difference is which one it's compiled for. Another problem is that this will proliferate versions, cluttering up our website, confusing our users, and causing the PMC members vote after vote. > > The second option would be to rework the pig package so that it had the jars for both, and the pig shell script figures out based on the Hadoop it finds which version is being used. This has the nice feature of guaranteeing the same features, but it has a few downsides. One, it bloats our package (since it's carrying multiple jars). Two, what happens when someone wants to add support for a new version (say Hadoop 22) to an existing release? Three, now a release manager must have access to all versions of Hadoop we claim to cover, or wait for help from those who do, in order to test a release. > > Hive chose the second option, and dealt with the bloating issue by isolating all the version specific code in one jar. > > We could deal with the concern of adding new versions to an existing release by saying it's not allowed. If you want to add a new supported version then you create a new version. This will devolve into versions 0.10 and 0.12 work on 20 and 23, but 0.11 works on 22. That will be horribly confusing for our users. > > I think the third issue of testability is going to mean certain Pig versions only support certain Hadoop versions without it being explicitly marked as well. Again, I think this is really bad. > > So I vote for the major.minor.patch.hadoopversion solution, though I think we should work hard to make it clear to users how to select the right version of Pig when downloading it. > > >> >> (2) Code changes required to support different version of Hadoop. This time around we made an effort to make sure that the same code can work with both. In the future that might not work and we would need to figure out how to maintain different code base. Most likely we would have to have additional branches off of main release branch > > Hopefully we can continue to do this via conditional compilation. Having different branches isn't maintainable. How do I push a Hadoop version specific patch to the next release? We'll get an ever growing collection of patches that have to be applied on a Hadoop specific branch for every release. We need to continue the rule that any patch must apply to the trunk, even when it's version specific. > >> >> (3) Anything else we need to consider? >> >> Olga > > Alan.
-
Re: [DISCUSSION]Pig releases with different versions of HadoopDmitriy Ryaboy 2011-11-08, 18:05
I suspect it might be easier / more maintainable / still useful to only
officially support a couple of versions (and test on both), with "best effort" support for others. So, for example, the current de-facto situation is support 0.20.2 (currently the only "officially supported" version), and maybe 0.20.205 (which I am guessing is what Hortonworks devs / customers are mostly running). We can say that we provide "best effort" compatibility for CDH{2,3}. In the future, I see this shifting to "official" support for 0.20.205 and 0.23, with "best effort" compatibility for 0.22 , CDH{3,4}. Compile-time switches can control which hadoop version you build for. Pig should expose some way to programmatically determine which version of hadoop it was compiled against (and what version of Pig it is). Ideally, we could rely on BigTop to help with ensuring a reasonable compatibility level with the "best effort" versions. I suspect maintaining a separate release for every hadoop version, given the number of them, is going to be unmaintainable. D On Tue, Nov 8, 2011 at 8:04 AM, Alan Gates <[EMAIL PROTECTED]> wrote: > > On Nov 7, 2011, at 11:15 AM, Olga Natkovich wrote: > > > Hi, > > > > In the past we have for the most part avoided supporting multiple > versions of Hadoop with the same version of Pig. This is about to change > with release of Hadoop 23. We need to come up with a strategy on how to > support that. There are a couple of issues to consider: > > > > > > (1) Version numbering. Seems like encoding the information in the > last version number makes sense. The details of the encoding need to be > hashed out > > I can see two options. One is to do major.minor.patch.hadoopversion, so > for example 0.10.1.h23 and 0.10.1.h20. The problem I see with that is we > *have* to guarantee that they have the same functionality. That is, 0.10.1 > has all the same patches regardless of which Hadoop version it is > (excepting maybe patches specific to a particular Hadoop version), the only > difference is which one it's compiled for. Another problem is that this > will proliferate versions, cluttering up our website, confusing our users, > and causing the PMC members vote after vote. > > The second option would be to rework the pig package so that it had the > jars for both, and the pig shell script figures out based on the Hadoop it > finds which version is being used. This has the nice feature of > guaranteeing the same features, but it has a few downsides. One, it bloats > our package (since it's carrying multiple jars). Two, what happens when > someone wants to add support for a new version (say Hadoop 22) to an > existing release? Three, now a release manager must have access to all > versions of Hadoop we claim to cover, or wait for help from those who do, > in order to test a release. > > Hive chose the second option, and dealt with the bloating issue by > isolating all the version specific code in one jar. > > We could deal with the concern of adding new versions to an existing > release by saying it's not allowed. If you want to add a new supported > version then you create a new version. This will devolve into versions > 0.10 and 0.12 work on 20 and 23, but 0.11 works on 22. That will be > horribly confusing for our users. > > I think the third issue of testability is going to mean certain Pig > versions only support certain Hadoop versions without it being explicitly > marked as well. Again, I think this is really bad. > > So I vote for the major.minor.patch.hadoopversion solution, though I think > we should work hard to make it clear to users how to select the right > version of Pig when downloading it. > > > > > > (2) Code changes required to support different version of Hadoop. > This time around we made an effort to make sure that the same code can work > with both. In the future that might not work and we would need to figure > out how to maintain different code base. Most likely we would have to have > additional branches off of main release branch |