|
Guillaume Nodet
2012-07-09, 13:24
Robert Evans
2012-07-09, 14:37
Guillaume Nodet
2012-07-09, 15:04
Robert Evans
2012-07-09, 18:13
Owen O'Malley
2012-07-09, 18:30
Jean-Baptiste Onofré
2012-07-09, 18:44
Jean-Baptiste Onofré
2012-07-09, 14:50
|
-
OSGi and classloadersGuillaume Nodet 2012-07-09, 13:24
I'm working with Jean-Baptiste to make hadoop work in OSGi.
OSGi works with classloader in a very specific way which leads to several problems with hadoop. Let me quickly explain how OSGi works. In OSGi, you deploy bundles, which are jars with additional OSGi metadata. This metadata is used by the OSGi framework to create a classloader for the bundle. However, the classloaders are not organized in a tree like in a JEE environment, but rather in some kind of graph, where each classloader has limited visibility and limited exposure. This is controlled by at the package level by specifying which packages are exported and which packages are imported by a given bundle. This is mainly two consequences: * OSGi does not supports well split-packages, where the same package is exported by two different bundles * a classloader does not have visibility on everything as in a usual flat classloader environment or even JEE-like env The first problem arise for example with the org.apache.hadoop.fs package which is split across hadoop-common and hadoop-hdfs jars (which defines the Hdfs class). There may be other cases, but I haven't hit them yet. To solve this problem, it'd be better if such classes were moved into a different package. The second problem is much more complicated. I think most of the classloading is done from Configuration. However, Configuration has an internal classloader which is set by the constructor to the thread context classloader (defaulting to the Configuration class' classloader) and new Configuration objects are created everywhere in the code. In addition, creating new Configuration objects force the parsing of the configuration files several times. Also in OSGi, Configuration is better done through the standard OSGi ConfigurationAdmin service, so it would be nice to integrate the configuration into ConfigAdmin when running in OSGi. For the above reasons, I'd like to know what would you think of transforming the Configuration object into a real singleton, or at least replacing the "new Configuration()" call spread everywhere with the access to a singleton Configuration.getInstance(). This would allow the hadoop osgi layer to manage the Configuration in a more osgi friendly way, allowing the use of a specific subclass which could better manage the class loading in an OSGi environment and integrate with ConfigAdmin. This may also remove the need for keeping a registry of existing Configuration and having to update them when a default resource if added for example. Some of the above problems have been addressed in some way in HADOOP-7977, but the fixes I've been working on were more related to hadoop 1.0.x branch, and are slightly unapplicable to trunk. One last point: the two above problems are mainly due to the fact that I've been assuming that individual hadoop jars are transformed into native bundles. This would go away if we'd have a single bundle containing all the individual jars (as it was with hadoop-core-1.0.x, but having more fine grained jars is better imho. Thoughts welcomed. -- ------------------------ Guillaume Nodet ------------------------ Blog: http://gnodet.blogspot.com/ ------------------------ FuseSource, Integration everywhere http://fusesource.com +
Guillaume Nodet 2012-07-09, 13:24
-
Re: OSGi and classloadersRobert Evans 2012-07-09, 14:37
Guillaume,
I am not super familiar with OSGi. I have used it a little in the past, but that was 5+ years ago. I am in favor of something that will fix the CLASSPATH problems that we currently have and would allow for CLASSPATH isolation between Hadoop itself and the applications that use Hadoop. If OSGi can do this cleanly then I am +1 for moving to OSGi. However, we are trying to maintain binary compatibility within major version numbers, in preparation for rolling upgrades. Many of the things you have suggested like moving classes from one package to another, and doing some serious rework to Configuration will break not only binary compatibility but also API compatibility. If we do go this rout, just be aware that it is most likely something that would have to force a major version bump, which right now means trunk (the 3.0 line). --Bobby Evans On 7/9/12 8:24 AM, "Guillaume Nodet" <[EMAIL PROTECTED]> wrote: >I'm working with Jean-Baptiste to make hadoop work in OSGi. >OSGi works with classloader in a very specific way which leads to several >problems with hadoop. > >Let me quickly explain how OSGi works. In OSGi, you deploy bundles, which >are jars with additional OSGi metadata. This metadata is used by the OSGi >framework to create a classloader for the bundle. However, the >classloaders are not organized in a tree like in a JEE environment, but >rather in some kind of graph, where each classloader has limited >visibility >and limited exposure. This is controlled by at the package level by >specifying which packages are exported and which packages are imported by >a >given bundle. This is mainly two consequences: > * OSGi does not supports well split-packages, where the same package is >exported by two different bundles > * a classloader does not have visibility on everything as in a usual >flat >classloader environment or even JEE-like env > >The first problem arise for example with the org.apache.hadoop.fs package >which is split across hadoop-common and hadoop-hdfs jars (which defines >the >Hdfs class). There may be other cases, but I haven't hit them yet. To >solve this problem, it'd be better if such classes were moved into a >different package. > >The second problem is much more complicated. I think most of the >classloading is done from Configuration. However, Configuration has an >internal classloader which is set by the constructor to the thread context >classloader (defaulting to the Configuration class' classloader) and new >Configuration objects are created everywhere in the code. >In addition, creating new Configuration objects force the parsing of the >configuration files several times. >Also in OSGi, Configuration is better done through the standard OSGi >ConfigurationAdmin service, so it would be nice to integrate the >configuration into ConfigAdmin when running in OSGi. >For the above reasons, I'd like to know what would you think of >transforming the Configuration object into a real singleton, or at least >replacing the "new Configuration()" call spread everywhere with the access >to a singleton Configuration.getInstance(). >This would allow the hadoop osgi layer to manage the Configuration in a >more osgi friendly way, allowing the use of a specific subclass which >could >better manage the class loading in an OSGi environment and integrate with >ConfigAdmin. This may also remove the need for keeping a registry of >existing Configuration and having to update them when a default resource >if >added for example. > >Some of the above problems have been addressed in some way in HADOOP-7977, >but the fixes I've been working on were more related to hadoop 1.0.x >branch, and are slightly unapplicable to trunk. > >One last point: the two above problems are mainly due to the fact that >I've >been assuming that individual hadoop jars are transformed into native >bundles. This would go away if we'd have a single bundle containing all >the individual jars (as it was with hadoop-core-1.0.x, but having more +
Robert Evans 2012-07-09, 14:37
-
Re: OSGi and classloadersGuillaume Nodet 2012-07-09, 15:04
Right, that would surely be incompatible. The initial work I did was on
1.0.3 and those problems can be solved in a more simple (though less clean) way in that branch, mainly because of the fact that there is a single jar which contain everything, so that causes less problems in OSGi. For trunk, is there any valid reason to create multiple configurations ? Or is the idea of a singleton something that I can investigate working on ? I'm not very familiar with hadoop internals, so I may very well be missing some edge cases. If not, I can come up with a patch that would transform Configuration into a singleton, leading to more flexibility for OSGi and a performance improvement by avoiding re-parsing the xml configuration multiple times. On Mon, Jul 9, 2012 at 4:37 PM, Robert Evans <[EMAIL PROTECTED]> wrote: > Guillaume, > > I am not super familiar with OSGi. I have used it a little in the past, > but that was 5+ years ago. I am in favor of something that will fix the > CLASSPATH problems that we currently have and would allow for CLASSPATH > isolation between Hadoop itself and the applications that use Hadoop. If > OSGi can do this cleanly then I am +1 for moving to OSGi. > > However, we are trying to maintain binary compatibility within major > version numbers, in preparation for rolling upgrades. Many of the things > you have suggested like moving classes from one package to another, and > doing some serious rework to Configuration will break not only binary > compatibility but also API compatibility. > > If we do go this rout, just be aware that it is most likely something that > would have to force a major version bump, which right now means trunk (the > 3.0 line). > > --Bobby Evans > > On 7/9/12 8:24 AM, "Guillaume Nodet" <[EMAIL PROTECTED]> wrote: > > >I'm working with Jean-Baptiste to make hadoop work in OSGi. > >OSGi works with classloader in a very specific way which leads to several > >problems with hadoop. > > > >Let me quickly explain how OSGi works. In OSGi, you deploy bundles, which > >are jars with additional OSGi metadata. This metadata is used by the OSGi > >framework to create a classloader for the bundle. However, the > >classloaders are not organized in a tree like in a JEE environment, but > >rather in some kind of graph, where each classloader has limited > >visibility > >and limited exposure. This is controlled by at the package level by > >specifying which packages are exported and which packages are imported by > >a > >given bundle. This is mainly two consequences: > > * OSGi does not supports well split-packages, where the same package is > >exported by two different bundles > > * a classloader does not have visibility on everything as in a usual > >flat > >classloader environment or even JEE-like env > > > >The first problem arise for example with the org.apache.hadoop.fs package > >which is split across hadoop-common and hadoop-hdfs jars (which defines > >the > >Hdfs class). There may be other cases, but I haven't hit them yet. To > >solve this problem, it'd be better if such classes were moved into a > >different package. > > > >The second problem is much more complicated. I think most of the > >classloading is done from Configuration. However, Configuration has an > >internal classloader which is set by the constructor to the thread context > >classloader (defaulting to the Configuration class' classloader) and new > >Configuration objects are created everywhere in the code. > >In addition, creating new Configuration objects force the parsing of the > >configuration files several times. > >Also in OSGi, Configuration is better done through the standard OSGi > >ConfigurationAdmin service, so it would be nice to integrate the > >configuration into ConfigAdmin when running in OSGi. > >For the above reasons, I'd like to know what would you think of > >transforming the Configuration object into a real singleton, or at least > >replacing the "new Configuration()" call spread everywhere with the access Guillaume Nodet Blog: http://gnodet.blogspot.com/ FuseSource, Integration everywhere http://fusesource.com +
Guillaume Nodet 2012-07-09, 15:04
-
Re: OSGi and classloadersRobert Evans 2012-07-09, 18:13
Guillaume,
The problem with Configuration is that it is public, so changing it does not just impact Hadoop. It also impacts all of the projects that use it, either directly as part of the Map/Reduce APIs or for storing their own configuration. Within Hadoop proper there are several places where it cannot just be static. For Map Reduce a Configuration object is created for each Map/Reduce job. So from a client's perspective it may have multiple different instances of Configuration in flight at any point in time, one for each job. HDFS also support this having multiple separate configurations in the client simultaneously. For some things processes like the NameNode, DataNode and the ResourceManager you may be able to get away with a single static configuration, but from the clients perspective that may be difficult. I am not really sure about the NodeManger, because it interacts with HDFS on behalf of the end user and I am not completely sure how Configuration fits into that picture. --Bobby Evans On 7/9/12 10:04 AM, "Guillaume Nodet" <[EMAIL PROTECTED]> wrote: >Right, that would surely be incompatible. The initial work I did was on >1.0.3 and those problems can be solved in a more simple (though less >clean) >way in that branch, mainly because of the fact that there is a single jar >which contain everything, so that causes less problems in OSGi. > >For trunk, is there any valid reason to create multiple configurations ? >Or >is the idea of a singleton something that I can investigate working on ? > I'm not very familiar with hadoop internals, so I may very well be >missing >some edge cases. If not, I can come up with a patch that would transform >Configuration into a singleton, leading to more flexibility for OSGi and a >performance improvement by avoiding re-parsing the xml configuration >multiple times. > >On Mon, Jul 9, 2012 at 4:37 PM, Robert Evans <[EMAIL PROTECTED]> wrote: > >> Guillaume, >> >> I am not super familiar with OSGi. I have used it a little in the past, >> but that was 5+ years ago. I am in favor of something that will fix the >> CLASSPATH problems that we currently have and would allow for CLASSPATH >> isolation between Hadoop itself and the applications that use Hadoop. >>If >> OSGi can do this cleanly then I am +1 for moving to OSGi. >> >> However, we are trying to maintain binary compatibility within major >> version numbers, in preparation for rolling upgrades. Many of the >>things >> you have suggested like moving classes from one package to another, and >> doing some serious rework to Configuration will break not only binary >> compatibility but also API compatibility. >> >> If we do go this rout, just be aware that it is most likely something >>that >> would have to force a major version bump, which right now means trunk >>(the >> 3.0 line). >> >> --Bobby Evans >> >> On 7/9/12 8:24 AM, "Guillaume Nodet" <[EMAIL PROTECTED]> wrote: >> >> >I'm working with Jean-Baptiste to make hadoop work in OSGi. >> >OSGi works with classloader in a very specific way which leads to >>several >> >problems with hadoop. >> > >> >Let me quickly explain how OSGi works. In OSGi, you deploy bundles, >>which >> >are jars with additional OSGi metadata. This metadata is used by the >>OSGi >> >framework to create a classloader for the bundle. However, the >> >classloaders are not organized in a tree like in a JEE environment, but >> >rather in some kind of graph, where each classloader has limited >> >visibility >> >and limited exposure. This is controlled by at the package level by >> >specifying which packages are exported and which packages are imported >>by >> >a >> >given bundle. This is mainly two consequences: >> > * OSGi does not supports well split-packages, where the same package >>is >> >exported by two different bundles >> > * a classloader does not have visibility on everything as in a usual >> >flat >> >classloader environment or even JEE-like env >> > >> >The first problem arise for example with the org.apache.hadoop.fs +
Robert Evans 2012-07-09, 18:13
-
Re: OSGi and classloadersOwen O'Malley 2012-07-09, 18:30
Changing the configurations is a big and very touchy job. It is touchy in
that it is very exposed to the users and many many applications assume the configuration is dealt with in particular ways. It is a requirement to maintain compatibility and thus that needs to be factored in to the work. Furthermore, you not only have the local configuration, but the way that configuration is done between the clients, servers, and mapreduce tasks. Even little changes in the past (eg. making a copy of a configuration at one spot) have broken both users and frameworks built on top (eg. Pig, Hive, Oozie). As Bobby said, Configuration is absolutely not a singleton. Many of the servers (JobTracker, Oozie, etc.) use configurations to keep track of the different contexts for each user. You could go to a dependency injection approach based on Guice to make it pluggable and yet context sensitive. -- Owen +
Owen O'Malley 2012-07-09, 18:30
-
Re: OSGi and classloadersJean-Baptiste Onofré 2012-07-09, 18:44
Thanks for the update guys.
We are going to look for a way to handle configurations in an OSGi way without changing the API/dependency. Regards JB On 07/09/2012 08:30 PM, Owen O'Malley wrote: > Changing the configurations is a big and very touchy job. It is touchy in > that it is very exposed to the users and many many applications assume the > configuration is dealt with in particular ways. It is a requirement to > maintain compatibility and thus that needs to be factored in to the work. > Furthermore, you not only have the local configuration, but the way that > configuration is done between the clients, servers, and mapreduce tasks. > Even little changes in the past (eg. making a copy of a configuration at > one spot) have broken both users and frameworks built on top (eg. Pig, > Hive, Oozie). > > As Bobby said, Configuration is absolutely not a singleton. Many of the > servers (JobTracker, Oozie, etc.) use configurations to keep track of the > different contexts for each user. You could go to a dependency injection > approach based on Guice to make it pluggable and yet context sensitive. > > -- Owen > -- Jean-Baptiste Onofré [EMAIL PROTECTED] http://blog.nanthrax.net Talend - http://www.talend.com +
Jean-Baptiste Onofré 2012-07-09, 18:44
-
Re: OSGi and classloadersJean-Baptiste Onofré 2012-07-09, 14:50
Hi Bobby,
Guillaume and I are working on trunk. So it makes sense to focus on trunk for this kind of refactoring. We are working on a fork branch on github. We can choose when merge our changes to trunk (or a dedicated branch). Regards JB On 07/09/2012 04:37 PM, Robert Evans wrote: > Guillaume, > > I am not super familiar with OSGi. I have used it a little in the past, > but that was 5+ years ago. I am in favor of something that will fix the > CLASSPATH problems that we currently have and would allow for CLASSPATH > isolation between Hadoop itself and the applications that use Hadoop. If > OSGi can do this cleanly then I am +1 for moving to OSGi. > > However, we are trying to maintain binary compatibility within major > version numbers, in preparation for rolling upgrades. Many of the things > you have suggested like moving classes from one package to another, and > doing some serious rework to Configuration will break not only binary > compatibility but also API compatibility. > > If we do go this rout, just be aware that it is most likely something that > would have to force a major version bump, which right now means trunk (the > 3.0 line). > > --Bobby Evans > > On 7/9/12 8:24 AM, "Guillaume Nodet" <[EMAIL PROTECTED]> wrote: > >> I'm working with Jean-Baptiste to make hadoop work in OSGi. >> OSGi works with classloader in a very specific way which leads to several >> problems with hadoop. >> >> Let me quickly explain how OSGi works. In OSGi, you deploy bundles, which >> are jars with additional OSGi metadata. This metadata is used by the OSGi >> framework to create a classloader for the bundle. However, the >> classloaders are not organized in a tree like in a JEE environment, but >> rather in some kind of graph, where each classloader has limited >> visibility >> and limited exposure. This is controlled by at the package level by >> specifying which packages are exported and which packages are imported by >> a >> given bundle. This is mainly two consequences: >> * OSGi does not supports well split-packages, where the same package is >> exported by two different bundles >> * a classloader does not have visibility on everything as in a usual >> flat >> classloader environment or even JEE-like env >> >> The first problem arise for example with the org.apache.hadoop.fs package >> which is split across hadoop-common and hadoop-hdfs jars (which defines >> the >> Hdfs class). There may be other cases, but I haven't hit them yet. To >> solve this problem, it'd be better if such classes were moved into a >> different package. >> >> The second problem is much more complicated. I think most of the >> classloading is done from Configuration. However, Configuration has an >> internal classloader which is set by the constructor to the thread context >> classloader (defaulting to the Configuration class' classloader) and new >> Configuration objects are created everywhere in the code. >> In addition, creating new Configuration objects force the parsing of the >> configuration files several times. >> Also in OSGi, Configuration is better done through the standard OSGi >> ConfigurationAdmin service, so it would be nice to integrate the >> configuration into ConfigAdmin when running in OSGi. >> For the above reasons, I'd like to know what would you think of >> transforming the Configuration object into a real singleton, or at least >> replacing the "new Configuration()" call spread everywhere with the access >> to a singleton Configuration.getInstance(). >> This would allow the hadoop osgi layer to manage the Configuration in a >> more osgi friendly way, allowing the use of a specific subclass which >> could >> better manage the class loading in an OSGi environment and integrate with >> ConfigAdmin. This may also remove the need for keeping a registry of >> existing Configuration and having to update them when a default resource >> if >> added for example. >> >> Some of the above problems have been addressed in some way in HADOOP-7977, Jean-Baptiste Onofré [EMAIL PROTECTED] http://blog.nanthrax.net Talend - http://www.talend.com +
Jean-Baptiste Onofré 2012-07-09, 14:50
|