|
|
-
Too large class path for map reduce jobs
Henning Blohm 2010-09-17, 11:56
When running map reduce tasks in Hadoop I run into classpath issues. Contrary to previous posts, my problem is not that I am missing classes on the Task's class path (we have a perfect solution for that) but rather find too many (e.g. ECJ classes or jetty).
The libs in HADOOP_HOME/lib seem to contain everything needed to run anything in Hadoop which is, I assume, much more than is needed to run a map reduce task.
Is there a doable way of taking just those needed for map reduce jobs and have the class path for m/r tasks point to just those?
I.e.: What would that set of libs comprise and where to specify the class path for m/r tasks?
Thanks, Henning
+
Henning Blohm 2010-09-17, 11:56
-
Re: Too large class path for map reduce jobs
Allen Wittenauer 2010-09-17, 16:01
On Sep 17, 2010, at 4:56 AM, Henning Blohm wrote:
> When running map reduce tasks in Hadoop I run into classpath issues. Contrary to previous posts, my problem is not that I am missing classes on the Task's class path (we have a perfect solution for that) but rather find too many (e.g. ECJ classes or jetty).
The fact that you mention:
> The libs in HADOOP_HOME/lib seem to contain everything needed to run anything in Hadoop which is, I assume, much more than is needed to run a map reduce task.
hints that your perfect solution is to throw all your custom stuff in lib. If so, that's a huge mistake. Use distributed cache instead.
+
Allen Wittenauer 2010-09-17, 16:01
-
Re: Too large class path for map reduce jobs
Henning Blohm 2010-09-17, 18:53
Not really. "Anything in Hadoop" was really meant to say just that.
The way we want to run tasks is with integrated provisioning of everything needed (using www.z2-environment.eu ). So effectively a Hadoop task loads the provisioning capability in process and then runs the actual task implementation as provisioned (from another repository actually), so that we do not need to have a special build process for Hadoop Jobs.
However, everything on the class path of the hadoop task is visible to the code of the z2 system and the task implementation and may lead to conflict with other code. Specifically the Java compiler implementation that is on the Hadoop class path (due to the use of Jasper) conflicts with the one we use. That's why we would like to run Hadoop tasks without unnecessary stuff (e.g. Jasper) on the class path.
Thanks, Henning
Am Freitag, den 17.09.2010, 16:01 +0000 schrieb Allen Wittenauer:
> On Sep 17, 2010, at 4:56 AM, Henning Blohm wrote: > > > When running map reduce tasks in Hadoop I run into classpath issues. Contrary to previous posts, my problem is not that I am missing classes on the Task's class path (we have a perfect solution for that) but rather find too many (e.g. ECJ classes or jetty). > > The fact that you mention: > > > The libs in HADOOP_HOME/lib seem to contain everything needed to run anything in Hadoop which is, I assume, much more than is needed to run a map reduce task. > > hints that your perfect solution is to throw all your custom stuff in lib. If so, that's a huge mistake. Use distributed cache instead.
+
Henning Blohm 2010-09-17, 18:53
-
Re: Too large class path for map reduce jobs
Henning Blohm 2010-09-24, 10:41
Short update on the issue:
I tried to find a way to separate class path configurations by modifying the scripts in HADOOP_HOME/bin but found that TaskRunner actually copies the class path setting from the parent process when starting a local task so that I do not see a way of having less on a job's classpath without modifying Hadoop.
As that will present a real issue when running our jobs on Hadoop I would like to propose to change TaskRunner so that it sets a class path specifically for M/R tasks. That class path could be defined in the scipts (as for the other processes) using a particular environment variable (e.g. HADOOP_JOB_CLASSPATH). It could default to the current VM's class path, preserving today's behavior.
Is it ok to enter this as an issue?
Thanks, Henning Am Freitag, den 17.09.2010, 16:01 +0000 schrieb Allen Wittenauer:
> On Sep 17, 2010, at 4:56 AM, Henning Blohm wrote: > > > When running map reduce tasks in Hadoop I run into classpath issues. Contrary to previous posts, my problem is not that I am missing classes on the Task's class path (we have a perfect solution for that) but rather find too many (e.g. ECJ classes or jetty). > > The fact that you mention: > > > The libs in HADOOP_HOME/lib seem to contain everything needed to run anything in Hadoop which is, I assume, much more than is needed to run a map reduce task. > > hints that your perfect solution is to throw all your custom stuff in lib. If so, that's a huge mistake. Use distributed cache instead.
+
Henning Blohm 2010-09-24, 10:41
-
Re: Too large class path for map reduce jobs
Tom White 2010-10-05, 22:59
Hi Henning, I don't know if you've seen https://issues.apache.org/jira/browse/MAPREDUCE-1938 and https://issues.apache.org/jira/browse/MAPREDUCE-1700 which have discussion about this issue. Cheers Tom On Fri, Sep 24, 2010 at 3:41 AM, Henning Blohm <[EMAIL PROTECTED]> wrote: > Short update on the issue: > > I tried to find a way to separate class path configurations by modifying the > scripts in HADOOP_HOME/bin but found that TaskRunner actually copies the > class path setting from the parent process when starting a local task so > that I do not see a way of having less on a job's classpath without > modifying Hadoop. > > As that will present a real issue when running our jobs on Hadoop I would > like to propose to change TaskRunner so that it sets a class path > specifically for M/R tasks. That class path could be defined in the scipts > (as for the other processes) using a particular environment variable (e.g. > HADOOP_JOB_CLASSPATH). It could default to the current VM's class path, > preserving today's behavior. > > Is it ok to enter this as an issue? > > Thanks, > Henning > > > Am Freitag, den 17.09.2010, 16:01 +0000 schrieb Allen Wittenauer: > > On Sep 17, 2010, at 4:56 AM, Henning Blohm wrote: > >> When running map reduce tasks in Hadoop I run into classpath issues. >> Contrary to previous posts, my problem is not that I am missing classes on >> the Task's class path (we have a perfect solution for that) but rather find >> too many (e.g. ECJ classes or jetty). > > The fact that you mention: > >> The libs in HADOOP_HOME/lib seem to contain everything needed to run >> anything in Hadoop which is, I assume, much more than is needed to run a map >> reduce task. > > hints that your perfect solution is to throw all your custom stuff in lib. > If so, that's a huge mistake. Use distributed cache instead. >
+
Tom White 2010-10-05, 22:59
-
Re: Too large class path for map reduce jobs
Henning Blohm 2010-10-06, 09:57
Hi Tom, that's exactly it. Thanks! I don't think that I can comment on the issues in Jira so I will do it here. Tricking with class paths and deviating from the default class loading delegation has never been anything but a short term relieve. Fixing things by imposing a "better" order of stuff on the class path will not work when people do actually use child loaders (as the parent win) - like we do. Also it may easily lead to very confusing situations because the former part of the class path is not complete and gets other stuff from a latter part etc. etc.... no good. Child loaders are good for module separation but should not be used to "hide" type visibiliy from the parent. Almost certainly leading to Class Loader Contraint Violation - once you lose control (which is usually earlier than expected). The suggestion to reduce the Job class path to the required minimum is the most practical approach. There is some gray area there of course and it will not be feasible to reach the absolute minimal set of types there - but something reasonable, i.e. the hadoop core that suffices to run the job. Certainly jetty & co are not required for job execution (btw. I "hacked" 0.20.2 to remove anything in "server/" from the classpath before setting the job class path). I would suggest to a) introduce some HADOOP_JOB_CLASSPATH var that, if set, is the additional classpath, added to the "core" classpath (as described above). If not set, for compatibility, preserve today's behavior. b) not getting into custom child loaders for jobs as part of hadoop M/R. It's non-trivial to get it right and feels to be beyond scope. I wouldn't mind helping btw. Thanks, Henning On Tue, 2010-10-05 at 15:59 -0700, Tom White wrote: > Hi Henning, > > I don't know if you've seen > https://issues.apache.org/jira/browse/MAPREDUCE-1938 and > https://issues.apache.org/jira/browse/MAPREDUCE-1700 which have > discussion about this issue. > > Cheers > Tom > > On Fri, Sep 24, 2010 at 3:41 AM, Henning Blohm <[EMAIL PROTECTED]> wrote: > > Short update on the issue: > > > > I tried to find a way to separate class path configurations by modifying the > > scripts in HADOOP_HOME/bin but found that TaskRunner actually copies the > > class path setting from the parent process when starting a local task so > > that I do not see a way of having less on a job's classpath without > > modifying Hadoop. > > > > As that will present a real issue when running our jobs on Hadoop I would > > like to propose to change TaskRunner so that it sets a class path > > specifically for M/R tasks. That class path could be defined in the scipts > > (as for the other processes) using a particular environment variable (e.g. > > HADOOP_JOB_CLASSPATH). It could default to the current VM's class path, > > preserving today's behavior. > > > > Is it ok to enter this as an issue? > > > > Thanks, > > Henning > > > > > > Am Freitag, den 17.09.2010, 16:01 +0000 schrieb Allen Wittenauer: > > > > On Sep 17, 2010, at 4:56 AM, Henning Blohm wrote: > > > >> When running map reduce tasks in Hadoop I run into classpath issues. > >> Contrary to previous posts, my problem is not that I am missing classes on > >> the Task's class path (we have a perfect solution for that) but rather find > >> too many (e.g. ECJ classes or jetty). > > > > The fact that you mention: > > > >> The libs in HADOOP_HOME/lib seem to contain everything needed to run > >> anything in Hadoop which is, I assume, much more than is needed to run a map > >> reduce task. > > > > hints that your perfect solution is to throw all your custom stuff in lib. > > If so, that's a huge mistake. Use distributed cache instead. > >
+
Henning Blohm 2010-10-06, 09:57
-
Re: Too large class path for map reduce jobs
Alejandro Abdelnur 2010-10-06, 10:28
1. Classloader business can be done right. Actually it could be done as spec-ed for servlet web-apps. 2. If the issue is strictly 'too large classpath', then a simpler solution would be to sof-link all JARs to the current directory and create the classpath with the JAR names only (no path). Note that the soft-linking business is already supported by the DistributedCache. So the changes would be mostly in the TT to create the JAR names only classpath before starting the child. Alejandro On Wed, Oct 6, 2010 at 5:57 PM, Henning Blohm <[EMAIL PROTECTED]>wrote: > Hi Tom, > > that's exactly it. Thanks! I don't think that I can comment on the issues > in Jira so I will do it here. > > Tricking with class paths and deviating from the default class loading > delegation has never been anything but a short term relieve. Fixing things > by imposing a "better" order of stuff on the class path will not work when > people do actually use child loaders (as the parent win) - like we do. Also > it may easily lead to very confusing situations because the former part of > the class path is not complete and gets other stuff from a latter part etc. > etc.... no good. > > Child loaders are good for module separation but should not be used to > "hide" type visibiliy from the parent. Almost certainly leading to Class > Loader Contraint Violation - once you lose control (which is usually earlier > than expected). > > The suggestion to reduce the Job class path to the required minimum is > the most practical approach. There is some gray area there of course and it > will not be feasible to reach the absolute minimal set of types there - but > something reasonable, i.e. the hadoop core that suffices to run the job. > Certainly jetty & co are not required for job execution (btw. I "hacked" > 0.20.2 to remove anything in "server/" from the classpath before setting the > job class path). > > I would suggest to > > a) introduce some HADOOP_JOB_CLASSPATH var that, if set, is the > additional classpath, added to the "core" classpath (as described above). If > not set, for compatibility, preserve today's behavior. > b) not getting into custom child loaders for jobs as part of hadoop M/R. > It's non-trivial to get it right and feels to be beyond scope. > > I wouldn't mind helping btw. > > Thanks, > Henning > > > > On Tue, 2010-10-05 at 15:59 -0700, Tom White wrote: > > Hi Henning, > > I don't know if you've seen https://issues.apache.org/jira/browse/MAPREDUCE-1938 and https://issues.apache.org/jira/browse/MAPREDUCE-1700 which have > discussion about this issue. > > Cheers > Tom > > On Fri, Sep 24, 2010 at 3:41 AM, Henning Blohm <[EMAIL PROTECTED]> wrote: > > Short update on the issue: > > > > I tried to find a way to separate class path configurations by modifying the > > scripts in HADOOP_HOME/bin but found that TaskRunner actually copies the > > class path setting from the parent process when starting a local task so > > that I do not see a way of having less on a job's classpath without > > modifying Hadoop. > > > > As that will present a real issue when running our jobs on Hadoop I would > > like to propose to change TaskRunner so that it sets a class path > > specifically for M/R tasks. That class path could be defined in the scipts > > (as for the other processes) using a particular environment variable (e.g. > > HADOOP_JOB_CLASSPATH). It could default to the current VM's class path, > > preserving today's behavior. > > > > Is it ok to enter this as an issue? > > > > Thanks, > > Henning > > > > > > Am Freitag, den 17.09.2010, 16:01 +0000 schrieb Allen Wittenauer: > > > > On Sep 17, 2010, at 4:56 AM, Henning Blohm wrote: > > > >> When running map reduce tasks in Hadoop I run into classpath issues. > >> Contrary to previous posts, my problem is not that I am missing classes on > >> the Task's class path (we have a perfect solution for that) but rather find > >> too many (e.g. ECJ classes or jetty). > > > > The fact that you mention:
+
Alejandro Abdelnur 2010-10-06, 10:28
-
Re: Too large class path for map reduce jobs
Henning Blohm 2010-10-06, 11:57
Hi Alejandro, yes, it can of course be done right (sorry if my wording seemed to imply otherwise). Just saying that I think that Hadoop M/R should not go into that class loader / module separation business. It's one Job, one VM, right? So the problem is to assign just the stuff needed to let the Job do its business without becoming an obstacle. Must admit I didn't understand your proposal 2. How would that remove (e.g.) jetty libs from the job's classpath? Thanks, Henning Am Mittwoch, den 06.10.2010, 18:28 +0800 schrieb Alejandro Abdelnur: > 1. Classloader business can be done right. Actually it could be done > as spec-ed for servlet web-apps. > > > > 2. If the issue is strictly 'too large classpath', then a simpler > solution would be to sof-link all JARs to the current directory and > create the classpath with the JAR names only (no path). Note that the > soft-linking business is already supported by the DistributedCache. So > the changes would be mostly in the TT to create the JAR names only > classpath before starting the child. > > > Alejandro > > > > On Wed, Oct 6, 2010 at 5:57 PM, Henning Blohm > <[EMAIL PROTECTED]> wrote: > > Hi Tom, > > that's exactly it. Thanks! I don't think that I can comment > on the issues in Jira so I will do it here. > > Tricking with class paths and deviating from the default > class loading delegation has never been anything but a short > term relieve. Fixing things by imposing a "better" order of > stuff on the class path will not work when people do actually > use child loaders (as the parent win) - like we do. Also it > may easily lead to very confusing situations because the > former part of the class path is not complete and gets other > stuff from a latter part etc. etc.... no good. > > Child loaders are good for module separation but should not > be used to "hide" type visibiliy from the parent. Almost > certainly leading to Class Loader Contraint Violation - once > you lose control (which is usually earlier than expected). > > The suggestion to reduce the Job class path to the required > minimum is the most practical approach. There is some gray > area there of course and it will not be feasible to reach the > absolute minimal set of types there - but something > reasonable, i.e. the hadoop core that suffices to run the job. > Certainly jetty & co are not required for job execution (btw. > I "hacked" 0.20.2 to remove anything in "server/" from the > classpath before setting the job class path). > > I would suggest to > > a) introduce some HADOOP_JOB_CLASSPATH var that, if set, is > the additional classpath, added to the "core" classpath (as > described above). If not set, for compatibility, preserve > today's behavior. > b) not getting into custom child loaders for jobs as part of > hadoop M/R. It's non-trivial to get it right and feels to be > beyond scope. > > I wouldn't mind helping btw. > > Thanks, > Henning > > > > > > On Tue, 2010-10-05 at 15:59 -0700, Tom White wrote: > > > Hi Henning, > > > > I don't know if you've seen > > https://issues.apache.org/jira/browse/MAPREDUCE-1938 and > > https://issues.apache.org/jira/browse/MAPREDUCE-1700 which have > > discussion about this issue. > > > > Cheers > > Tom > > > > On Fri, Sep 24, 2010 at 3:41 AM, Henning Blohm <[EMAIL PROTECTED]> wrote: > > > Short update on the issue: > > > > > > I tried to find a way to separate class path configurations by modifying the > > > scripts in HADOOP_HOME/bin but found that TaskRunner actually copies the
+
Henning Blohm 2010-10-06, 11:57
-
Re: Too large class path for map reduce jobs
Alejandro Abdelnur 2010-10-07, 05:02
Fragmentation of Hadoop classpaths is another issue: hadoop should differentiate the CP in 3:
1*client CP: what is needed to submit a job (only the nachos) 2*server CP (JT/NN/TT/DD): what is need to run the cluster (the whole enchilada) 3*job CP: what is needed to run a job (some of the enchilada)
But i'm not trying to get into that here. What I'm suggesting is: ----- # Hadoop JARs:
/Users/tucu/dev-apps/hadoop/conf /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/lib/tools.jar /Users/tucu/dev-apps/hadoop/bin/.. /Users/tucu/dev-apps/hadoop/bin/../hadoop-core-0.20.3-CDH3-SNAPSHOT.jar /Users/tucu/dev-apps/hadoop/bin/../lib/aspectjrt-1.6.5.jar
..... (about 30 jars from hadoop lib/ )
/Users/tucu/dev-apps/hadoop/bin/../lib/jsp-2.1/jsp-api-2.1.jar
# Job JARs (for a job with only 2 JARs):
/Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/-2707763075630339038_639898034_1993697040/localhost/user/tucu/oozie-tucu/0000003-101004184132247-oozie-tucu-W/java-node--java/java-launcher.jar /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/3613772770922728555_-588832047_1993624983/localhost/user/tucu/examples/apps/java-main/lib/oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/tucu/jobcache/job_201010041326_0058/attempt_201010041326_0058_m_000000_0/work ----- What I'm suggesting is that the later group, the job JARs to be soft-linked (by the TT) into the working directory, then their classpath is just:
----- java-launcher.jar oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar . ----- Alejandro
On Wed, Oct 6, 2010 at 7:57 PM, Henning Blohm <[EMAIL PROTECTED]>wrote:
> Hi Alejandro, > > yes, it can of course be done right (sorry if my wording seemed to imply > otherwise). Just saying that I think that Hadoop M/R should not go into that > class loader / module separation business. It's one Job, one VM, right? So > the problem is to assign just the stuff needed to let the Job do its > business without becoming an obstacle. > > Must admit I didn't understand your proposal 2. How would that remove > (e.g.) jetty libs from the job's classpath? > > Thanks, > Henning > > Am Mittwoch, den 06.10.2010, 18:28 +0800 schrieb Alejandro Abdelnur: > > 1. Classloader business can be done right. Actually it could be done as > spec-ed for servlet web-apps. > > > > 2. If the issue is strictly 'too large classpath', then a simpler > solution would be to sof-link all JARs to the current directory and create > the classpath with the JAR names only (no path). Note that the soft-linking > business is already supported by the DistributedCache. So the changes would > be mostly in the TT to create the JAR names only classpath before starting > the child. > > > > Alejandro > > > > On Wed, Oct 6, 2010 at 5:57 PM, Henning Blohm <[EMAIL PROTECTED]> > wrote: > > Hi Tom, > > that's exactly it. Thanks! I don't think that I can comment on the issues > in Jira so I will do it here. > > Tricking with class paths and deviating from the default class loading > delegation has never been anything but a short term relieve. Fixing things > by imposing a "better" order of stuff on the class path will not work when > people do actually use child loaders (as the parent win) - like we do. Also > it may easily lead to very confusing situations because the former part of > the class path is not complete and gets other stuff from a latter part etc. > etc.... no good. > > Child loaders are good for module separation but should not be used to > "hide" type visibiliy from the parent. Almost certainly leading to Class > Loader Contraint Violation - once you lose control (which is usually earlier > than expected). > > The suggestion to reduce the Job class path to the required minimum is > the most practical approach. There is some gray area there of course and it > will not be feasible to reach the absolute minimal set of types there - but > something reasonable, i.e. the hadoop core that suffices to run the job.
+
Alejandro Abdelnur 2010-10-07, 05:02
-
Re: Too large class path for map reduce jobs
Alejandro Abdelnur 2010-10-07, 05:22
[sent too soon]
The first CP shown is how it is today the CP of a task. If we change it pick up all the job JARs from the current dir, then the classpath will be much shorter (second CP shown). We can easily achieve this by soft-linking the job JARs in the work dir of the task.
Alejandro
On Thu, Oct 7, 2010 at 1:02 PM, Alejandro Abdelnur <[EMAIL PROTECTED]>wrote:
> Fragmentation of Hadoop classpaths is another issue: hadoop should > differentiate the CP in 3: > > 1*client CP: what is needed to submit a job (only the nachos) > 2*server CP (JT/NN/TT/DD): what is need to run the cluster (the whole > enchilada) > 3*job CP: what is needed to run a job (some of the enchilada) > > But i'm not trying to get into that here. What I'm suggesting is: > > > ----- > # Hadoop JARs: > > /Users/tucu/dev-apps/hadoop/conf > /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/lib/tools.jar > /Users/tucu/dev-apps/hadoop/bin/.. > /Users/tucu/dev-apps/hadoop/bin/../hadoop-core-0.20.3-CDH3-SNAPSHOT.jar > /Users/tucu/dev-apps/hadoop/bin/../lib/aspectjrt-1.6.5.jar > > ..... (about 30 jars from hadoop lib/ ) > > /Users/tucu/dev-apps/hadoop/bin/../lib/jsp-2.1/jsp-api-2.1.jar > > # Job JARs (for a job with only 2 JARs): > > > /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/-2707763075630339038_639898034_1993697040/localhost/user/tucu/oozie-tucu/0000003-101004184132247-oozie-tucu-W/java-node--java/java-launcher.jar > > /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/3613772770922728555_-588832047_1993624983/localhost/user/tucu/examples/apps/java-main/lib/oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar > > /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/tucu/jobcache/job_201010041326_0058/attempt_201010041326_0058_m_000000_0/work > ----- > > > What I'm suggesting is that the later group, the job JARs to be soft-linked > (by the TT) into the working directory, then their classpath is just: > > ----- > java-launcher.jar > oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar > . > ----- > > > Alejandro > > On Wed, Oct 6, 2010 at 7:57 PM, Henning Blohm <[EMAIL PROTECTED]>wrote: > >> Hi Alejandro, >> >> yes, it can of course be done right (sorry if my wording seemed to >> imply otherwise). Just saying that I think that Hadoop M/R should not go >> into that class loader / module separation business. It's one Job, one VM, >> right? So the problem is to assign just the stuff needed to let the Job do >> its business without becoming an obstacle. >> >> Must admit I didn't understand your proposal 2. How would that remove >> (e.g.) jetty libs from the job's classpath? >> >> Thanks, >> Henning >> >> Am Mittwoch, den 06.10.2010, 18:28 +0800 schrieb Alejandro Abdelnur: >> >> 1. Classloader business can be done right. Actually it could be done as >> spec-ed for servlet web-apps. >> >> >> >> 2. If the issue is strictly 'too large classpath', then a simpler >> solution would be to sof-link all JARs to the current directory and create >> the classpath with the JAR names only (no path). Note that the soft-linking >> business is already supported by the DistributedCache. So the changes would >> be mostly in the TT to create the JAR names only classpath before starting >> the child. >> >> >> >> Alejandro >> >> >> >> On Wed, Oct 6, 2010 at 5:57 PM, Henning Blohm <[EMAIL PROTECTED]> >> wrote: >> >> Hi Tom, >> >> that's exactly it. Thanks! I don't think that I can comment on the >> issues in Jira so I will do it here. >> >> Tricking with class paths and deviating from the default class loading >> delegation has never been anything but a short term relieve. Fixing things >> by imposing a "better" order of stuff on the class path will not work when >> people do actually use child loaders (as the parent win) - like we do. Also >> it may easily lead to very confusing situations because the former part of >> the class path is not complete and gets other stuff from a latter part etc. >> etc.... no good. >> >> Child loaders are good for module separation but should not be used to
+
Alejandro Abdelnur 2010-10-07, 05:22
-
Re: Too large class path for map reduce jobs
Henning Blohm 2010-10-07, 07:43
So that's actually another issue, right? Besides splitting the classpath into those three groups, you want the TT to create soft-links on demand to simplify the computation of classpath string. Is that right?
But it's the TT that actually starts the job VM. Why does it matter what the string actually looks like, as long as it has the right content?
Thanks, Henning
On Thu, 2010-10-07 at 13:22 +0800, Alejandro Abdelnur wrote: > [sent too soon] > > > The first CP shown is how it is today the CP of a task. If we change > it pick up all the job JARs from the current dir, then the classpath > will be much shorter (second CP shown). We can easily achieve this by > soft-linking the job JARs in the work dir of the task. > > > Alejandro > > > On Thu, Oct 7, 2010 at 1:02 PM, Alejandro Abdelnur <[EMAIL PROTECTED]> > wrote: > > Fragmentation of Hadoop classpaths is another issue: hadoop > should differentiate the CP in 3: > > > > 1*client CP: what is needed to submit a job (only the nachos) > 2*server CP (JT/NN/TT/DD): what is need to run the cluster > (the whole enchilada) > 3*job CP: what is needed to run a job (some of the enchilada) > > > But i'm not trying to get into that here. What I'm suggesting > is: > > > > > ----- > # Hadoop JARs: > > > /Users/tucu/dev-apps/hadoop/conf > /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/lib/tools.jar > /Users/tucu/dev-apps/hadoop/bin/.. > /Users/tucu/dev-apps/hadoop/bin/../hadoop-core-0.20.3-CDH3-SNAPSHOT.jar > /Users/tucu/dev-apps/hadoop/bin/../lib/aspectjrt-1.6.5.jar > > > ..... (about 30 jars from hadoop lib/ ) > > > /Users/tucu/dev-apps/hadoop/bin/../lib/jsp-2.1/jsp-api-2.1.jar > > > # Job JARs (for a job with only 2 JARs): > > > /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/-2707763075630339038_639898034_1993697040/localhost/user/tucu/oozie-tucu/0000003-101004184132247-oozie-tucu-W/java-node--java/java-launcher.jar > /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/3613772770922728555_-588832047_1993624983/localhost/user/tucu/examples/apps/java-main/lib/oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar > /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/tucu/jobcache/job_201010041326_0058/attempt_201010041326_0058_m_000000_0/work > ----- > > > > > What I'm suggesting is that the later group, the job JARs to > be soft-linked (by the TT) into the working directory, then > their classpath is just: > > > ----- > java-launcher.jar > oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar > . > ----- > > > > > > Alejandro > > > On Wed, Oct 6, 2010 at 7:57 PM, Henning Blohm > <[EMAIL PROTECTED]> wrote: > > Hi Alejandro, > > yes, it can of course be done right (sorry if my > wording seemed to imply otherwise). Just saying that I > think that Hadoop M/R should not go into that class > loader / module separation business. It's one Job, one > VM, right? So the problem is to assign just the stuff > needed to let the Job do its business without becoming > an obstacle. > > Must admit I didn't understand your proposal 2. How > would that remove (e.g.) jetty libs from the job's > classpath? > > Thanks, > Henning > > Am Mittwoch, den 06.10.2010, 18:28 +0800 schrieb
+
Henning Blohm 2010-10-07, 07:43
-
Re: Too large class path for map reduce jobs
Alejandro Abdelnur 2010-10-07, 08:23
well, if the issue is a too long classpath, the softlink thingy will give some room to breath as the total CP length will be much smaller.
A
On Thu, Oct 7, 2010 at 3:43 PM, Henning Blohm <[EMAIL PROTECTED]>wrote:
> So that's actually another issue, right? Besides splitting the classpath > into those three groups, you want the TT to create soft-links on demand to > simplify the computation of classpath string. Is that right? > > But it's the TT that actually starts the job VM. Why does it matter what > the string actually looks like, as long as it has the right content? > > Thanks, > Henning > > > On Thu, 2010-10-07 at 13:22 +0800, Alejandro Abdelnur wrote: > > [sent too soon] > > > > The first CP shown is how it is today the CP of a task. If we change it > pick up all the job JARs from the current dir, then the classpath will be > much shorter (second CP shown). We can easily achieve this by soft-linking > the job JARs in the work dir of the task. > > > > Alejandro > > On Thu, Oct 7, 2010 at 1:02 PM, Alejandro Abdelnur <[EMAIL PROTECTED]> > wrote: > > Fragmentation of Hadoop classpaths is another issue: hadoop should > differentiate the CP in 3: > > > > 1*client CP: what is needed to submit a job (only the nachos) > > 2*server CP (JT/NN/TT/DD): what is need to run the cluster (the whole > enchilada) > > 3*job CP: what is needed to run a job (some of the enchilada) > > > But i'm not trying to get into that here. What I'm suggesting is: > > > > > > ----- > > # Hadoop JARs: > > > > /Users/tucu/dev-apps/hadoop/conf > > /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/lib/tools.jar > > /Users/tucu/dev-apps/hadoop/bin/.. > > /Users/tucu/dev-apps/hadoop/bin/../hadoop-core-0.20.3-CDH3-SNAPSHOT.jar > > /Users/tucu/dev-apps/hadoop/bin/../lib/aspectjrt-1.6.5.jar > > > > ..... (about 30 jars from hadoop lib/ ) > > > > /Users/tucu/dev-apps/hadoop/bin/../lib/jsp-2.1/jsp-api-2.1.jar > > > > # Job JARs (for a job with only 2 JARs): > > > > /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/-2707763075630339038_639898034_1993697040/localhost/user/tucu/oozie-tucu/0000003-101004184132247-oozie-tucu-W/java-node--java/java-launcher.jar > > > /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/3613772770922728555_-588832047_1993624983/localhost/user/tucu/examples/apps/java-main/lib/oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar > > > /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/tucu/jobcache/job_201010041326_0058/attempt_201010041326_0058_m_000000_0/work > > > ----- > > > > > > What I'm suggesting is that the later group, the job JARs to be > soft-linked (by the TT) into the working directory, then their classpath is > just: > > > > ----- > > java-launcher.jar > > oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar > > . > > ----- > > > > > > > Alejandro > > > > On Wed, Oct 6, 2010 at 7:57 PM, Henning Blohm <[EMAIL PROTECTED]> > wrote: > > Hi Alejandro, > > yes, it can of course be done right (sorry if my wording seemed to imply > otherwise). Just saying that I think that Hadoop M/R should not go into that > class loader / module separation business. It's one Job, one VM, right? So > the problem is to assign just the stuff needed to let the Job do its > business without becoming an obstacle. > > Must admit I didn't understand your proposal 2. How would that remove > (e.g.) jetty libs from the job's classpath? > > Thanks, > Henning > > Am Mittwoch, den 06.10.2010, 18:28 +0800 schrieb Alejandro Abdelnur: > > > > 1. Classloader business can be done right. Actually it could be done as > spec-ed for servlet web-apps. > > > 2. If the issue is strictly 'too large classpath', then a simpler solution > would be to sof-link all JARs to the current directory and create the > classpath with the JAR names only (no path). Note that the soft-linking > business is already supported by the DistributedCache. So the changes would > be mostly in the TT to create the JAR names only classpath before starting
+
Alejandro Abdelnur 2010-10-07, 08:23
-
Re: Too large class path for map reduce jobs
Tom White 2010-10-07, 20:27
I wonder if there is a misunderstanding here - the problem is that the classpath has too many classes on it (and clashes with user classes), rather than it being a text string which is too long.
I would suggest that the technical discussion of how to fix this goes onto the JIRA.
Cheers, Tom
On Thu, Oct 7, 2010 at 1:23 AM, Alejandro Abdelnur <[EMAIL PROTECTED]> wrote: > well, if the issue is a too long classpath, the softlink thingy will give > some room to breath as the total CP length will be much smaller. > > A > On Thu, Oct 7, 2010 at 3:43 PM, Henning Blohm <[EMAIL PROTECTED]> > wrote: >> >> So that's actually another issue, right? Besides splitting the classpath >> into those three groups, you want the TT to create soft-links on demand to >> simplify the computation of classpath string. Is that right? >> >> But it's the TT that actually starts the job VM. Why does it matter what >> the string actually looks like, as long as it has the right content? >> >> Thanks, >> Henning >> >> On Thu, 2010-10-07 at 13:22 +0800, Alejandro Abdelnur wrote: >> >> [sent too soon] >> >> The first CP shown is how it is today the CP of a task. If we change it >> pick up all the job JARs from the current dir, then the classpath will be >> much shorter (second CP shown). We can easily achieve this by soft-linking >> the job JARs in the work dir of the task. >> >> Alejandro >> >> On Thu, Oct 7, 2010 at 1:02 PM, Alejandro Abdelnur <[EMAIL PROTECTED]> >> wrote: >> >> Fragmentation of Hadoop classpaths is another issue: hadoop should >> differentiate the CP in 3: >> >> 1*client CP: what is needed to submit a job (only the nachos) >> >> 2*server CP (JT/NN/TT/DD): what is need to run the cluster (the whole >> enchilada) >> >> 3*job CP: what is needed to run a job (some of the enchilada) >> >> >> But i'm not trying to get into that here. What I'm suggesting is: >> >> >> >> ----- >> >> # Hadoop JARs: >> >> /Users/tucu/dev-apps/hadoop/conf >> >> >> /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/lib/tools.jar >> >> /Users/tucu/dev-apps/hadoop/bin/.. >> >> /Users/tucu/dev-apps/hadoop/bin/../hadoop-core-0.20.3-CDH3-SNAPSHOT.jar >> >> /Users/tucu/dev-apps/hadoop/bin/../lib/aspectjrt-1.6.5.jar >> >> ..... (about 30 jars from hadoop lib/ ) >> >> /Users/tucu/dev-apps/hadoop/bin/../lib/jsp-2.1/jsp-api-2.1.jar >> >> # Job JARs (for a job with only 2 JARs): >> >> >> /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/-2707763075630339038_639898034_1993697040/localhost/user/tucu/oozie-tucu/0000003-101004184132247-oozie-tucu-W/java-node--java/java-launcher.jar >> >> >> /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/3613772770922728555_-588832047_1993624983/localhost/user/tucu/examples/apps/java-main/lib/oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar >> >> >> /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/tucu/jobcache/job_201010041326_0058/attempt_201010041326_0058_m_000000_0/work >> >> ----- >> >> >> >> What I'm suggesting is that the later group, the job JARs to be >> soft-linked (by the TT) into the working directory, then their classpath is >> just: >> >> ----- >> >> java-launcher.jar >> >> oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar >> >> . >> >> ----- >> >> >> >> >> Alejandro >> >> On Wed, Oct 6, 2010 at 7:57 PM, Henning Blohm <[EMAIL PROTECTED]> >> wrote: >> >> Hi Alejandro, >> >> yes, it can of course be done right (sorry if my wording seemed to >> imply otherwise). Just saying that I think that Hadoop M/R should not go >> into that class loader / module separation business. It's one Job, one VM, >> right? So the problem is to assign just the stuff needed to let the Job do >> its business without becoming an obstacle. >> >> Must admit I didn't understand your proposal 2. How would that remove >> (e.g.) jetty libs from the job's classpath? >> >> Thanks, >> Henning >> >> Am Mittwoch, den 06.10.2010, 18:28 +0800 schrieb Alejandro Abdelnur: >> >> 1. Classloader business can be done right. Actually it could be done as
+
Tom White 2010-10-07, 20:27
-
Re: Too large class path for map reduce jobs
Henning Blohm 2010-10-08, 07:52
Ahh... that could indeed be the case. Yes, my issue was about "large" rather than "long".
Thanks for clarifying!
Henning
On Thu, 2010-10-07 at 13:27 -0700, Tom White wrote:
> I wonder if there is a misunderstanding here - the problem is that the > classpath has too many classes on it (and clashes with user classes), > rather than it being a text string which is too long. > > I would suggest that the technical discussion of how to fix this goes > onto the JIRA. > > Cheers, > Tom > > On Thu, Oct 7, 2010 at 1:23 AM, Alejandro Abdelnur <[EMAIL PROTECTED]> wrote: > > well, if the issue is a too long classpath, the softlink thingy will give > > some room to breath as the total CP length will be much smaller. > > > > A > > On Thu, Oct 7, 2010 at 3:43 PM, Henning Blohm <[EMAIL PROTECTED]> > > wrote: > >> > >> So that's actually another issue, right? Besides splitting the classpath > >> into those three groups, you want the TT to create soft-links on demand to > >> simplify the computation of classpath string. Is that right? > >> > >> But it's the TT that actually starts the job VM. Why does it matter what > >> the string actually looks like, as long as it has the right content? > >> > >> Thanks, > >> Henning > >> > >> On Thu, 2010-10-07 at 13:22 +0800, Alejandro Abdelnur wrote: > >> > >> [sent too soon] > >> > >> The first CP shown is how it is today the CP of a task. If we change it > >> pick up all the job JARs from the current dir, then the classpath will be > >> much shorter (second CP shown). We can easily achieve this by soft-linking > >> the job JARs in the work dir of the task. > >> > >> Alejandro > >> > >> On Thu, Oct 7, 2010 at 1:02 PM, Alejandro Abdelnur <[EMAIL PROTECTED]> > >> wrote: > >> > >> Fragmentation of Hadoop classpaths is another issue: hadoop should > >> differentiate the CP in 3: > >> > >> 1*client CP: what is needed to submit a job (only the nachos) > >> > >> 2*server CP (JT/NN/TT/DD): what is need to run the cluster (the whole > >> enchilada) > >> > >> 3*job CP: what is needed to run a job (some of the enchilada) > >> > >> > >> But i'm not trying to get into that here. What I'm suggesting is: > >> > >> > >> > >> ----- > >> > >> # Hadoop JARs: > >> > >> /Users/tucu/dev-apps/hadoop/conf > >> > >> > >> /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/lib/tools.jar > >> > >> /Users/tucu/dev-apps/hadoop/bin/.. > >> > >> /Users/tucu/dev-apps/hadoop/bin/../hadoop-core-0.20.3-CDH3-SNAPSHOT.jar > >> > >> /Users/tucu/dev-apps/hadoop/bin/../lib/aspectjrt-1.6.5.jar > >> > >> ..... (about 30 jars from hadoop lib/ ) > >> > >> /Users/tucu/dev-apps/hadoop/bin/../lib/jsp-2.1/jsp-api-2.1.jar > >> > >> # Job JARs (for a job with only 2 JARs): > >> > >> > >> /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/-2707763075630339038_639898034_1993697040/localhost/user/tucu/oozie-tucu/0000003-101004184132247-oozie-tucu-W/java-node--java/java-launcher.jar > >> > >> > >> /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/3613772770922728555_-588832047_1993624983/localhost/user/tucu/examples/apps/java-main/lib/oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar > >> > >> > >> /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/tucu/jobcache/job_201010041326_0058/attempt_201010041326_0058_m_000000_0/work > >> > >> ----- > >> > >> > >> > >> What I'm suggesting is that the later group, the job JARs to be > >> soft-linked (by the TT) into the working directory, then their classpath is > >> just: > >> > >> ----- > >> > >> java-launcher.jar > >> > >> oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar > >> > >> . > >> > >> ----- > >> > >> > >> > >> > >> Alejandro > >> > >> On Wed, Oct 6, 2010 at 7:57 PM, Henning Blohm <[EMAIL PROTECTED]> > >> wrote: > >> > >> Hi Alejandro, > >> > >> yes, it can of course be done right (sorry if my wording seemed to > >> imply otherwise). Just saying that I think that Hadoop M/R should not go > >> into that class loader / module separation business. It's one Job, one VM,
+
Henning Blohm 2010-10-08, 07:52
|
|