|
|
-
Re: Distributing our jars to all machines in a clusterPraveen Sripati 2011-11-20, 02:25
Hi,
Here are the different ways of distributing 3rd party jars with the application. http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/ Thanks, Praveen On Wed, Nov 16, 2011 at 11:30 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > Libjars works if your MR job is initialized correctly. Here's a code > snippet: > > public static void main(String[] args) throws Exception { > GenericOptionsParser optParser = new GenericOptionsParser(args); > int exitCode = ToolRunner.run(optParser.getConfiguration(), > new MyMRJob(), > optParser.getRemainingArgs()); > System.exit(exitCode); > } > > Pig works by re-jarring your whole application, and there's an > outstanding patch to make it run libjars -- which works, I've been > running it in production at Twitter. > > -D > > On Wed, Nov 16, 2011 at 9:00 AM, Something Something > <[EMAIL PROTECTED]> wrote: > > I agree. It will eventually get us in trouble. That's why we want to > get > > the -libjars option to work, but it's not working.. arrrghhh.. It's the > > simplest things in engineering that take the longest time... -:) > > > > Can you see why this may not work? > > > > /Users/xyz/hadoop-0.20.2/bin/hadoop jar > > /Users/xyz/modules/something/target/my.jar com.xyz.common.MyMapReduce > > -libjars /Users/xyz/modules/something/target/my.jar, > > /Users/xyz/avro-tools-1.5.4.jar > > > > On Wed, Nov 16, 2011 at 8:51 AM, Friso van Vollenhoven > > <[EMAIL PROTECTED]> wrote: > >> > >> You use maven jar-with-deps default assembly? That layout works too, but > >> it will give you problems eventually when you have different classes > with > >> the same package and name. > >> Java jar files are regular ZIP files. They can contain duplicate > entries. > >> I don't know whether your packaging creates duplicates in them, but if > it > >> does, it could be the cause of your problem. > >> Try checking your jar for a duplicate license dir in the META-INF > >> (something like: unzip -l <your-jar-name>.jar | awk '{print $4}' | sort > | > >> uniq -d) > >> > >> Friso > >> > >> On 16 nov. 2011, at 17:33, Something Something wrote: > >> > >> Thanks Bejoy & Friso. When I use the all-in-one jar file created by > Maven > >> I get this: > >> > >> Mkdirs failed to create > >> /Users/xyz/hdfs/hadoop-unjar4743660161930001886/META-INF/license > >> > >> > >> Do you recall coming across this? Our 'all-in-one' jar is not exactly > how > >> you have described it. It doesn't contain any JARs, but it has all the > >> classes from all the dependent JARs. > >> > >> > >> On Wed, Nov 16, 2011 at 7:59 AM, Friso van Vollenhoven > >> <[EMAIL PROTECTED]> wrote: > >>> > >>> We usually package my jobs as a single jar that contains a /lib > directory > >>> in the jar that contains all other jars that the job code depends on. > Hadoop > >>> understands this layout when run as 'hadoop jar'. So the jar layout > would be > >>> something like: > >>> /META-INF/manifest.mf > >>> /com/mypackage/MyMapperClass.class > >>> /com/mypackage/MyReducerClass.class > >>> /lib/dependency1.jar > >>> /lib/dependency2.jar > >>> etc. > >>> If you use Maven or some other build tool with dependency management, > you > >>> can usually produce this jar as part of your build. We also have Maven > write > >>> the main class to the manifest, such that there is no need to type it. > So > >>> for us, submitting a job looks like: > >>> hadoop jar jar-with-all-deps-in-lib.jar arg1 arg2 argN > >>> Then Hadoop will take care of submitting and distributing, etc. Of > course > >>> you pay the penalty of always sending all of your dependencies over > the wire > >>> (the job jar gets replicated to 10 machines by > default). Pre-distributing > >>> sounds tedious and error prone to me. What if you have different jobs > that > >>> require different versions of the same dependency? > >>> > >>> HTH, > >>> Friso > >>> > >>> > >>> > >>> > >>> On 16 nov. 2011, at 15:42, Something Something wrote: |