|
John Armstrong
2011-05-26, 14:45
vishnu krishnan
2011-05-26, 15:46
Robert Evans
2011-05-26, 15:59
vishnu krishnan
2011-05-26, 17:23
Robert Evans
2011-05-26, 17:34
vishnu krishnan
2011-05-26, 17:47
John Armstrong
2011-05-26, 17:50
vishnu krishnan
2011-05-26, 17:54
Alejandro Abdelnur
2011-05-27, 22:47
John Armstrong
2011-05-30, 13:34
Alejandro Abdelnur
2011-05-30, 16:43
John Armstrong
2011-05-30, 17:22
Alejandro Abdelnur
2011-05-31, 19:02
John Armstrong
2011-05-31, 19:09
John Armstrong
2011-06-01, 19:38
Alejandro Abdelnur
2011-06-01, 19:48
John Armstrong
2011-06-01, 20:06
|
-
Problems adding JARs to distributed classpath in Hadoop 0.20.2John Armstrong 2011-05-26, 14:45
Hi, everybody.
I'm running into some difficulties getting needed libraries to map/reduce tasks using the distributed cache. I'm using Hadoop 0.20.2, which from what I can tell is a hard requirement by the client, so more current versions are not really viable options. The code I've inherited is Java, which sets up and runs the MR job. There's currently some nontrivial pre- and post-processing, so it will be a large refactoring before I can just run bare MR jobs rather than starting them through Java. Further complicating matters: in practice the Java jobs are launched by Oozie, which of course does so by wrapping each one in a MR shell. The upshot is that I don't have any control over which "local" filesystem the Java job is run from, though if local files are absolutely needed I can make my Java wrappers copy stuff back from HDFS to the Java job's local filesystem. So here's the problem mappers and/or reducers need class Needed, which is contained in needed-1.0.jar, which is in HDFS: hdfs://.../libdir/distributed/needed-1.0.jar Java program executes: DistributedCache.addFiletoClassPath(new Path("hdfs://.../libdir/distributed/needed-1.0.jar"),job.getConfiguration()); Inspecting the Job object I find the file has been added to the cache files as expected: job.conf.overlay[...] = mapred.cache.files -> hdfs://.../libdir/distributed/needed-1.0.jar job.conf.properties[...] = mapred.cache.files -> hdfs://.../libdir/distributed/needed-1.0.jar And the class seems to show up in the internal ClassLoader: job.conf.classLoader.classes[...] = "class my.class.package.Needed" though this may just be inherited from the ClassLoader of the Java process itself (which also uses Needed). And yet as soon as I get into the mapreduce job itself I start getting: 2011-05-25 17:22:56,080 INFO JobClient - Task Id : attempt_201105251330_0037_r_000043_0, Status : FAILED java.lang.RuntimeException: java.lang.ClassNotFoundException: my.class.package.Needed Up until this point we've run things by having a directory on each node containing all the libraries we'd need, and including that in the Hadoop classpath, but we have no such control in the deployment scenario, so we have to make our program hand the needed libraries to the map and reduce nodes via the distributed cache classpath. Thanks in advance for any insight or assistance you can offer.
-
Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2vishnu krishnan 2011-05-26, 15:46
am new in map reduce. one think i have to know. can i use the map reduce pgm
without any file system?
-
Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2Robert Evans 2011-05-26, 15:59
Vishnu,
You have to have a file system that is accessible from all nodes involved to run Hadoop Map Reduce. This could be NFS if it is a small number of nodes or even the local file system if you are just running one node. But, with that said Hadoop is designed to process big data GB, TB, and even PB, so HDFS or some other distributed File System is best if that is what you are doing. You can use it simply to distribute a computing job to several different machines, but Hadoop Map Reduce still needs a file system as part of the distribution mechanism. --Bobby Evans On 5/26/11 10:46 AM, "vishnu krishnan" <[EMAIL PROTECTED]> wrote: am new in map reduce. one think i have to know. can i use the map reduce pgm without any file system?
-
Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2vishnu krishnan 2011-05-26, 17:23
thanku,
so just i want to take a GB of data and give to the map/reduce, then store into the database? -- Vishnu R Krishnan Software Engineer Create @ Amrita Amritapuri
-
Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2Robert Evans 2011-05-26, 17:34
If it is just a GB then you probably don't need Hadoop, unless there is some serious processing involved that hasn't been explained or you already have the data on HDFS, or you happen to have a Hadoop cluster that you have access to and the amount of data is going to grow in size. Then it could be worth it to write a M/R job to load the data into a DB.
--Bobby On 5/26/11 12:23 PM, "vishnu krishnan" <[EMAIL PROTECTED]> wrote: thanku, so just i want to take a GB of data and give to the map/reduce, then store into the database?
-
Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2vishnu krishnan 2011-05-26, 17:47
thanks,
if am not using using the map/reduce here, that just i directly sent dat data to the db, what will be the problems?
-
Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2John Armstrong 2011-05-26, 17:50
On Thu, 26 May 2011 23:17:43 +0530, vishnu krishnan
<[EMAIL PROTECTED]> wrote: > thanks, > > > if am not using using the map/reduce here, that just i directly sent dat > data to the db, what will be the problems? Look, I hate to be That Guy, especially on my first day on the list but would you mind moving to your own thread and not hijacking mine? Thanks.
-
Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2vishnu krishnan 2011-05-26, 17:54
sorry, i forgot dat, sorry, jst i am moving to a new thread.
thanks
-
Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2Alejandro Abdelnur 2011-05-27, 22:47
John,
If you are using Oozie, dropping all the JARs your MR jobs needs in the Oozie WF lib/ directory should suffice. Oozie will make sure all those JARs are in the distributed cache. Alejandro On Thu, May 26, 2011 at 7:45 AM, John Armstrong <[EMAIL PROTECTED]>wrote: > Hi, everybody. > > I'm running into some difficulties getting needed libraries to map/reduce > tasks using the distributed cache. > > I'm using Hadoop 0.20.2, which from what I can tell is a hard requirement > by the client, so more current versions are not really viable options. > > The code I've inherited is Java, which sets up and runs the MR job. > There's currently some nontrivial pre- and post-processing, so it will be a > large refactoring before I can just run bare MR jobs rather than starting > them through Java. > > Further complicating matters: in practice the Java jobs are launched by > Oozie, which of course does so by wrapping each one in a MR shell. The > upshot is that I don't have any control over which "local" filesystem the > Java job is run from, though if local files are absolutely needed I can > make my Java wrappers copy stuff back from HDFS to the Java job's local > filesystem. > > So here's the problem > > mappers and/or reducers need class Needed, which is contained in > needed-1.0.jar, which is in HDFS: > hdfs://.../libdir/distributed/needed-1.0.jar > > Java program executes: > DistributedCache.addFiletoClassPath(new > > Path("hdfs://.../libdir/distributed/needed-1.0.jar"),job.getConfiguration()); > > Inspecting the Job object I find the file has been added to the cache > files as expected: > job.conf.overlay[...] = mapred.cache.files -> > hdfs://.../libdir/distributed/needed-1.0.jar > job.conf.properties[...] = mapred.cache.files -> > hdfs://.../libdir/distributed/needed-1.0.jar > > And the class seems to show up in the internal ClassLoader: > job.conf.classLoader.classes[...] = "class my.class.package.Needed" > > though this may just be inherited from the ClassLoader of the Java process > itself (which also uses Needed). > > And yet as soon as I get into the mapreduce job itself I start getting: > > 2011-05-25 17:22:56,080 INFO JobClient - Task Id : > attempt_201105251330_0037_r_000043_0, Status : FAILED > java.lang.RuntimeException: java.lang.ClassNotFoundException: > my.class.package.Needed > > Up until this point we've run things by having a directory on each node > containing all the libraries we'd need, and including that in the Hadoop > classpath, but we have no such control in the deployment scenario, so we > have to make our program hand the needed libraries to the map and reduce > nodes via the distributed cache classpath. > > Thanks in advance for any insight or assistance you can offer. >
-
Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2John Armstrong 2011-05-30, 13:34
On Fri, 27 May 2011 15:47:23 -0700, Alejandro Abdelnur <[EMAIL PROTECTED]>
wrote: > John, > > If you are using Oozie, dropping all the JARs your MR jobs needs in the > Oozie WF lib/ directory should suffice. Oozie will make sure all those JARs > are in the distributed cache. That doesn't seem to work. I have this JAR in the WF /lib/ directory because the Java job that launches the MR job needs it. And yes, it's in the distributed cache for the wrapper MR job that Oozie uses to remotely run the Java job. The problem is it's not available for the MR job that the Java job launches. Thanks, though, for the suggestion.
-
Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2Alejandro Abdelnur 2011-05-30, 16:43
John,
Now I get what you are trying to do. My recommendation would be: * Use a Java action to do all the stuff prior to starting your MR job * Use a mapreduce action to start your MR job * If you need to propagate properties from the Java action to the MR action you can use the <capture-output> flag. If you still want to start your MR job from your Java action, then your Java action should do all the setup the MapReduceMain class does before starting the MR job (this will ensure delegation tokens and distributed cache is avail to your MR job). Thanks. Alejandro On Mon, May 30, 2011 at 6:34 AM, John Armstrong <[EMAIL PROTECTED]>wrote: > On Fri, 27 May 2011 15:47:23 -0700, Alejandro Abdelnur <[EMAIL PROTECTED]> > wrote: > > John, > > > > If you are using Oozie, dropping all the JARs your MR jobs needs in the > > Oozie WF lib/ directory should suffice. Oozie will make sure all those > JARs > > are in the distributed cache. > > That doesn't seem to work. I have this JAR in the WF /lib/ directory > because the Java job that launches the MR job needs it. And yes, it's in > the distributed cache for the wrapper MR job that Oozie uses to remotely > run the Java job. The problem is it's not available for the MR job that > the Java job launches. > > Thanks, though, for the suggestion. >
-
Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2John Armstrong 2011-05-30, 17:22
On Mon, 30 May 2011 09:43:14 -0700, Alejandro Abdelnur <[EMAIL PROTECTED]>
wrote: > If you still want to start your MR job from your Java action, then your > Java > action should do all the setup the MapReduceMain class does before starting > the MR job (this will ensure delegation tokens and distributed cache is > avail to your MR job). Yes, my Java action is doing the setup work. In particular, it calls DistrributedCache.addfileToClassPath(), which (according to the documentation) should be the same as passing it in at the command line with -libjars, right? And yet it doesn't seem to work. Is this the same as the JIRA issue MAPREDUCE-752? And if so, does this mean that there is no solution (other than a workaround like passing a fat JAR) that doesn't involve patching the Hadoop code itself (which I'd have to get our client to agree to)?
-
Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2Alejandro Abdelnur 2011-05-31, 19:02
What is exactly that does not work?
Oozie uses DistributeCache as the only mechanism to set classpaths to jobs and it works fine. Thanks. Alejandro On Mon, May 30, 2011 at 10:22 AM, John Armstrong <[EMAIL PROTECTED]>wrote: > On Mon, 30 May 2011 09:43:14 -0700, Alejandro Abdelnur <[EMAIL PROTECTED]> > wrote: > > If you still want to start your MR job from your Java action, then your > > Java > > action should do all the setup the MapReduceMain class does before > starting > > the MR job (this will ensure delegation tokens and distributed cache is > > avail to your MR job). > > Yes, my Java action is doing the setup work. In particular, it calls > DistrributedCache.addfileToClassPath(), which (according to the > documentation) should be the same as passing it in at the command line with > -libjars, right? And yet it doesn't seem to work. > > Is this the same as the JIRA issue MAPREDUCE-752? And if so, does this > mean that there is no solution (other than a workaround like passing a fat > JAR) that doesn't involve patching the Hadoop code itself (which I'd have > to get our client to agree to)? >
-
Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2John Armstrong 2011-05-31, 19:09
On Tue, 31 May 2011 12:02:28 -0700, Alejandro Abdelnur <[EMAIL PROTECTED]>
wrote: > What is exactly that does not work? Oozie launches a wrapper MapReduce job to run a Java job J1. Oozie's /lib/ directory is provided to the classpath of J1 as expected. This part works. The Java job J1 configures and launches a MapReduce job MR1. As part of the configuration, J1 tries to put some JARs on the distributed classpath for MR1 to use in its mappers and reducers. To do so, it calls DistributedCache.addFileToClassPath(jarfilePath). The file at jarfilePath DOES get added to the distributed cache. But the mapper for MR1 still throws a ClassNotFoundException, since the file at jarfilePath is NOT on the classpath for MR1. This is what doesn't work. I hope this explanation makes more sense. Thanks again for putting some thought to it.
-
Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2John Armstrong 2011-06-01, 19:38
On Tue, 31 May 2011 15:09:28 -0400, John Armstrong
<[EMAIL PROTECTED]> wrote: > On Tue, 31 May 2011 12:02:28 -0700, Alejandro Abdelnur <[EMAIL PROTECTED]> > wrote: >> What is exactly that does not work? In the hopes that more information can help, I've dug into the local filesystems on each of my four nodes and retrieved the job.xml and the locations of the files to show that everything shows up where it should. In this example have one regular file (hdfs://node1:hdfsport/hdfs/path/to/file1.foo) added with DistributedCache.addCacheFile(). I also have a JAR (hdfs://node1:hdfsport/hdfs/path/to/needed.jar) added with DistributedCache.addFileToClassPath(). The needed JAR is also part of the classpath Oozie provides to my Java task. As you can see, both files (with correct filesizes and timestamps) are listed as cache files in job.xml, and the JAR is listed as a classpath file. Both files show up on each node; the JAR shows up twice on node 1 since that's where Oozie ran the Java task, and thus where Oozie placed the JAR with its own use of the distributed cache. And yet, when mapreduce actually tries to run the job my Java task launches, it immediately hits a ClassNotFoundException, claiming it can't find the class my.class.package.Needed which is contained in needed.jar. JOB.XML ... <property> <!--Loaded from Unknown--> <name>mapred.job.classpath.files</name> <value>hdfs://node1:hdfsport/hdfs/path/to/needed.jar</value> </property> ... <property> <!--Loaded from Unknown--> <name>mapred.cache.files</name> <value>hdfs://node1:hdfsport/hdfs/path/to/file1.foo,hdfs://node1:hdfsport/hdfs/path/to/needed.jar</value> </property> ... <property> <!--Loaded from Unknown--> <name>mapred.cache.files.filesizes</name> <value>61175,2257057</value> </property> ... <property> <!--Loaded from Unknown--> <name>mapred.cache.files.timestamps</name> <value>1306949104866,1306949371660</value> </property> ... NODE 1 LOCAL FILESYSTEM /data/4/mapred/local/taskTracker/distcache/5181540010607464671_-132008737_1279047490/node1/hdfs/path/to/file1.foo /data/1/mapred/local/taskTracker/distcache/6423795395825083633_-1942178119_1279314284/node1/hdfs/path/to/needed.jar /data/3/mapred/local/taskTracker/distcache/2424191142954514770_1281905983_1269665052/node1/hdfs/path/to/needed.jar NODE 2 LOCAL FILESYSTEM /data/1/mapred/local/taskTracker/distcache/-1458632814086969626_-132008737_1279047490/node1/hdfs/path/to/file1.foo /data/2/mapred/local/taskTracker/distcache/4434671176913378591_-1942178119_1279314284/node1/hdfs/path/to/needed.jar NODE 3 LOCAL FILESYSTEM /data/1/mapred/local/taskTracker/distcache/-6763452370915390695_-132008737_1279047490/node1/hdfs/path/to/file1.foo /data/2/mapred/local/taskTracker/distcache/6838381597046551111_-1942178119_1279314284/node1/hdfs/path/to/needed.jar NODE 4 LOCAL FILESYSTEM /data/1/mapred/local/taskTracker/distcache/-1759547009148985681_-132008737_1279047490/node1/hdfs/path/to/file1.foo /data/2/mapred/local/taskTracker/distcache/1998811135309473771_-1942178119_1279314284/node1/hdfs/path/to/needed.jar SAMPLE MAPPER ATTEMPT LOG 2011-06-01 14:21:41,442 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library 2011-06-01 14:21:41,557 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /data/2/mapred/local/taskTracker/hdfs/jobcache/job_201106011430_0002/jars/job.jar <- /data/2/mapred/local/taskTracker/hdfs/jobcache/job_201106011430_0002/attempt_201106011430_0002_m_000009_0/work/./job.jar 2011-06-01 14:21:41,560 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /data/2/mapred/local/taskTracker/hdfs/jobcache/job_201106011430_0002/jars/.job.jar.crc <- /data/2/mapred/local/taskTracker/hdfs/jobcache/job_201106011430_0002/attempt_201106011430_0002_m_000009_0/work/./.job.jar.crc 2011-06-01 14:21:41,563 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId2011-06-01 14:21:41,660 WARN org.apache.hadoop.mapred.Child: Error running child java.lang.RuntimeException: java.lang.ClassNotFoundException: my.class.package.Needed at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:973) at org.apache.hadoop.mapreduce.JobContext.getOutputFormatClass(JobContext.java:236) at org.apache.hadoop.mapred.Task.initialize(Task.java:484) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:298) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063) at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: java.lang.ClassNotFoundException: my.class.package.Needed at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:920) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:971) ... 8 more
-
Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2Alejandro Abdelnur 2011-06-01, 19:48
John,
Do you have all JARs used by your classes in Needed.jar in the DC classpath as well? Are you propagating the delegation token? Thxs. Alejandro On Wed, Jun 1, 2011 at 12:38 PM, John Armstrong <[EMAIL PROTECTED]>wrote: > On Tue, 31 May 2011 15:09:28 -0400, John Armstrong > <[EMAIL PROTECTED]> wrote: > > On Tue, 31 May 2011 12:02:28 -0700, Alejandro Abdelnur > <[EMAIL PROTECTED]> > > wrote: > >> What is exactly that does not work? > > In the hopes that more information can help, I've dug into the local > filesystems on each of my four nodes and retrieved the job.xml and the > locations of the files to show that everything shows up where it should. > > In this example have one regular file > (hdfs://node1:hdfsport/hdfs/path/to/file1.foo) added with > DistributedCache.addCacheFile(). I also have a JAR > (hdfs://node1:hdfsport/hdfs/path/to/needed.jar) added with > DistributedCache.addFileToClassPath(). The needed JAR is also part of the > classpath Oozie provides to my Java task. > > As you can see, both files (with correct filesizes and timestamps) are > listed as cache files in job.xml, and the JAR is listed as a classpath > file. Both files show up on each node; the JAR shows up twice on node 1 > since that's where Oozie ran the Java task, and thus where Oozie placed the > JAR with its own use of the distributed cache. > > And yet, when mapreduce actually tries to run the job my Java task > launches, it immediately hits a ClassNotFoundException, claiming it can't > find the class my.class.package.Needed which is contained in needed.jar. > > JOB.XML > ... > <property> > <!--Loaded from Unknown--> > <name>mapred.job.classpath.files</name> > <value>hdfs://node1:hdfsport/hdfs/path/to/needed.jar</value> > </property> > ... > <property> > <!--Loaded from Unknown--> > <name>mapred.cache.files</name> > > > <value>hdfs://node1:hdfsport/hdfs/path/to/file1.foo,hdfs://node1:hdfsport/hdfs/path/to/needed.jar</value> > </property> > ... > <property> > <!--Loaded from Unknown--> > <name>mapred.cache.files.filesizes</name> > <value>61175,2257057</value> > </property> > ... > <property> > <!--Loaded from Unknown--> > <name>mapred.cache.files.timestamps</name> > <value>1306949104866,1306949371660</value> > </property> > ... > > NODE 1 LOCAL FILESYSTEM > > /data/4/mapred/local/taskTracker/distcache/5181540010607464671_-132008737_1279047490/node1/hdfs/path/to/file1.foo > > /data/1/mapred/local/taskTracker/distcache/6423795395825083633_-1942178119_1279314284/node1/hdfs/path/to/needed.jar > > /data/3/mapred/local/taskTracker/distcache/2424191142954514770_1281905983_1269665052/node1/hdfs/path/to/needed.jar > > NODE 2 LOCAL FILESYSTEM > > /data/1/mapred/local/taskTracker/distcache/-1458632814086969626_-132008737_1279047490/node1/hdfs/path/to/file1.foo > > /data/2/mapred/local/taskTracker/distcache/4434671176913378591_-1942178119_1279314284/node1/hdfs/path/to/needed.jar > > NODE 3 LOCAL FILESYSTEM > > /data/1/mapred/local/taskTracker/distcache/-6763452370915390695_-132008737_1279047490/node1/hdfs/path/to/file1.foo > > /data/2/mapred/local/taskTracker/distcache/6838381597046551111_-1942178119_1279314284/node1/hdfs/path/to/needed.jar > > NODE 4 LOCAL FILESYSTEM > > /data/1/mapred/local/taskTracker/distcache/-1759547009148985681_-132008737_1279047490/node1/hdfs/path/to/file1.foo > > /data/2/mapred/local/taskTracker/distcache/1998811135309473771_-1942178119_1279314284/node1/hdfs/path/to/needed.jar > > SAMPLE MAPPER ATTEMPT LOG > > 2011-06-01 14:21:41,442 INFO org.apache.hadoop.util.NativeCodeLoader: > Loaded the native-hadoop library > 2011-06-01 14:21:41,557 INFO > org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating > symlink: > > /data/2/mapred/local/taskTracker/hdfs/jobcache/job_201106011430_0002/jars/job.jar > <- > > /data/2/mapred/local/taskTracker/hdfs/jobcache/job_201106011430_0002/attempt_201106011430_0002_m_000009_0/work/./job.jar
-
Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2John Armstrong 2011-06-01, 20:06
On Wed, 1 Jun 2011 12:48:51 -0700, Alejandro Abdelnur <[EMAIL PROTECTED]>
wrote: > Do you have all JARs used by your classes in Needed.jar in the DC classpath > as well? needed.jar contains the class Needed, which my mappers need. If the class Needed calls for another class AlsoNeeded in another jar, wouldn't I get a ClassNotFoundException for AlsoNeeded? > Are you propagating the delegation token? Now we're getting somewhere: I don't have any idea what you mean by this. If this is something I need to be doing to get this technique to work, I'd love to see a reference teaching me how to do it. Thanks again. |