|
|
-
help in distribution of a task with hadoop
Pierre Antoine DuBoDeNa 2012-08-13, 17:59
Hello,
We use hadoop to distribute a task over our machines.
This task requires only the mapper class to be defined. We want to do some text processing in thousands of documents. So we create key-value pairs, where key is just an increasing number and value is the path of the file to be processed.
We face problem on including an external jar file/class while running a jar file.
$ mkdir Rdg_classes $ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d Rdg_classes Rdg.java $ jar -cvf Rdg.jar -C Rdg_classes/ . We have tried the following options:
*1. Set HADOOP_CLASSPATH with the location of external jar files or external classes.* It doesnt help. Instead, it starts de-recognizing the Reducer with below error:
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: hadoop.Rdg$Reduce at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:899) at org.apache.hadoop.mapred.JobConf.getCombinerClass(JobConf.java:1028) at org.apache.hadoop.mapred.Task$CombinerRunner.create(Task.java:1380) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:981) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: hadoop.Rdg$Reduce at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:867) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:891) ... 10 more Caused by: java.lang.ClassNotFoundException: hadoop.Rdg$Reduce at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865) ... 11 more
*2. Use -libjars option as below:* hadoop jar Rdg.jar my.hadoop.Rdg -libjars Rdg_lib/* tester rdg_output
Where Rdg_lib is the a folder containing all reqd classes/jars stored on HDFS. But it starts reading -libjars as an input as gives error as:
12/08/10 08:16:24 ERROR security.UserGroupInformation: PriviledgedActionException as:hduser cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://nameofserver:54310/user/hduser/-libjars Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://nameofserver:54310/user/hduser/-libjars
Is there any other way to do it? or we do anything wrong?
Best,
-
Re: help in distribution of a task with hadoop
Bertrand Dechoux 2012-08-13, 18:22
1) A standard way of doing it would be to have all your files content inside HDFS. You could then process <key,value> where key would be the name of the file and value its contents. It would improve performance : data locality, less network traffic... But you may have constraints...
2) Maven is a simple way of doing it.
Regards
Bertrand
On Mon, Aug 13, 2012 at 7:59 PM, Pierre Antoine DuBoDeNa <[EMAIL PROTECTED]>wrote:
> Hello, > > We use hadoop to distribute a task over our machines. > > This task requires only the mapper class to be defined. We want to do some > text processing in thousands of documents. So we create key-value pairs, > where key is just an increasing number and value is the path of the file to > be processed. > > We face problem on including an external jar file/class while running a jar > file. > > $ mkdir Rdg_classes > $ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d > Rdg_classes Rdg.java > $ jar -cvf Rdg.jar -C Rdg_classes/ . > We have tried the following options: > > *1. Set HADOOP_CLASSPATH with the location of external jar files or > external classes.* > It doesnt help. Instead, it starts de-recognizing the Reducer with below > error: > > java.lang.RuntimeException: java.lang.RuntimeException: > java.lang.ClassNotFoundException: hadoop.Rdg$Reduce > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:899) > at org.apache.hadoop.mapred.JobConf.getCombinerClass(JobConf.java:1028) > at org.apache.hadoop.mapred.Task$CombinerRunner.create(Task.java:1380) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:981) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: > hadoop.Rdg$Reduce > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:867) > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:891) > ... 10 more > Caused by: java.lang.ClassNotFoundException: hadoop.Rdg$Reduce > at java.net.URLClassLoader$1.run(URLClassLoader.java:202) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:190) > at java.lang.ClassLoader.loadClass(ClassLoader.java:306) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) > at java.lang.ClassLoader.loadClass(ClassLoader.java:247) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:247) > at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820) > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865) > ... 11 more > > *2. Use -libjars option as below:* > hadoop jar Rdg.jar my.hadoop.Rdg -libjars Rdg_lib/* tester rdg_output > > Where Rdg_lib is the a folder containing all reqd classes/jars stored on > HDFS. > But it starts reading -libjars as an input as gives error as: > > 12/08/10 08:16:24 ERROR security.UserGroupInformation: > PriviledgedActionException as:hduser > cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not > exist: hdfs://nameofserver:54310/user/hduser/-libjars > Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: > Input path does not exist: hdfs://nameofserver:54310/user/hduser/-libjars > > Is there any other way to do it? or we do anything wrong? > > Best, >
-- Bertrand Dechoux
-
Re: help in distribution of a task with hadoop
Pierre Antoine DuBoDeNa 2012-08-13, 18:27
We have all documents moved to HDFS. I understand with our 1st option we need more I/O than what you say but let's say that's not a problem for now.
Could you please point me on 2) option? how could we do that? any tutorial or example?
Thanks
2012/8/13 Bertrand Dechoux <[EMAIL PROTECTED]>
> 1) A standard way of doing it would be to have all your files content > inside HDFS. You could then process <key,value> where key would be the name > of the file and value its contents. It would improve performance : data > locality, less network traffic... But you may have constraints... > > 2) Maven is a simple way of doing it. > > Regards > > Bertrand > > On Mon, Aug 13, 2012 at 7:59 PM, Pierre Antoine DuBoDeNa > <[EMAIL PROTECTED]>wrote: > > > Hello, > > > > We use hadoop to distribute a task over our machines. > > > > This task requires only the mapper class to be defined. We want to do > some > > text processing in thousands of documents. So we create key-value pairs, > > where key is just an increasing number and value is the path of the file > to > > be processed. > > > > We face problem on including an external jar file/class while running a > jar > > file. > > > > $ mkdir Rdg_classes > > $ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d > > Rdg_classes Rdg.java > > $ jar -cvf Rdg.jar -C Rdg_classes/ . > > We have tried the following options: > > > > *1. Set HADOOP_CLASSPATH with the location of external jar files or > > external classes.* > > It doesnt help. Instead, it starts de-recognizing the Reducer with below > > error: > > > > java.lang.RuntimeException: java.lang.RuntimeException: > > java.lang.ClassNotFoundException: hadoop.Rdg$Reduce > > at > > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:899) > > at > org.apache.hadoop.mapred.JobConf.getCombinerClass(JobConf.java:1028) > > at > org.apache.hadoop.mapred.Task$CombinerRunner.create(Task.java:1380) > > at > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:981) > > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:396) > > at > > > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > > at org.apache.hadoop.mapred.Child.main(Child.java:249) > > Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: > > hadoop.Rdg$Reduce > > at > > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:867) > > at > > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:891) > > ... 10 more > > Caused by: java.lang.ClassNotFoundException: hadoop.Rdg$Reduce > > at java.net.URLClassLoader$1.run(URLClassLoader.java:202) > > at java.security.AccessController.doPrivileged(Native Method) > > at java.net.URLClassLoader.findClass(URLClassLoader.java:190) > > at java.lang.ClassLoader.loadClass(ClassLoader.java:306) > > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) > > at java.lang.ClassLoader.loadClass(ClassLoader.java:247) > > at java.lang.Class.forName0(Native Method) > > at java.lang.Class.forName(Class.java:247) > > at > > > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820) > > at > > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865) > > ... 11 more > > > > *2. Use -libjars option as below:* > > hadoop jar Rdg.jar my.hadoop.Rdg -libjars Rdg_lib/* tester rdg_output > > > > Where Rdg_lib is the a folder containing all reqd classes/jars stored on > > HDFS. > > But it starts reading -libjars as an input as gives error as: > > > > 12/08/10 08:16:24 ERROR security.UserGroupInformation: > > PriviledgedActionException as:hduser > > cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
-
Re: help in distribution of a task with hadoop
Bejoy Ks 2012-08-13, 18:29
Hi Bertrand
-libjars option works well with the 'hadoop jar' command. Instead of executing your runnable with the plain java 'jar' command use 'hadoop jar' . When you use hadoop jar you can ship the dependent jars/files etc as 1) include them in the /lib folder in your jar 2) use -libjars / -files to distribute jars or files
Regards Bejoy KS
-
Re: help in distribution of a task with hadoop
Pierre Antoine DuBoDeNa 2012-08-13, 18:32
You mean like that:
hadoop jar Rdg.jar my.hadoop.Rdg -libjars Rdg_lib/* tester rdg_output
Where Rdg_lib is the a folder containing all reqd classes/jars stored on HDFS.
We get this error though. We do something wrong?
12/08/10 08:16:24 ERROR security.UserGroupInformation: PriviledgedActionException as:hduser cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://nameofserver:54310/user/hduser/-libjars Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://nameofserver:54310/user/hduser/-libjars
2012/8/13 Bejoy Ks <[EMAIL PROTECTED]>
> Hi Bertrand > > -libjars option works well with the 'hadoop jar' command. Instead of > executing your runnable with the plain java 'jar' command use 'hadoop jar' > . When you use hadoop jar you can ship the dependent jars/files etc as > 1) include them in the /lib folder in your jar > 2) use -libjars / -files to distribute jars or files > > Regards > Bejoy KS >
|
|