Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: Distributing our jars to all machines in a cluster


Copy link to this message
-
Re: Distributing our jars to all machines in a cluster
Hi,

Here are the different ways of distributing 3rd party jars with the
application.

http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/

Thanks,
Praveen

On Wed, Nov 16, 2011 at 11:30 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:

> Libjars works if your MR job is initialized correctly. Here's a code
> snippet:
>
>  public static void main(String[] args) throws Exception {
>    GenericOptionsParser optParser = new GenericOptionsParser(args);
>    int exitCode = ToolRunner.run(optParser.getConfiguration(),
>        new MyMRJob(),
>        optParser.getRemainingArgs());
>    System.exit(exitCode);
>  }
>
> Pig works by re-jarring your whole application, and there's an
> outstanding patch to make it run libjars -- which works, I've been
> running it in production at Twitter.
>
> -D
>
> On Wed, Nov 16, 2011 at 9:00 AM, Something Something
> <[EMAIL PROTECTED]> wrote:
> > I agree.  It will eventually get us in trouble.  That's why we want to
> get
> > the -libjars option to work, but it's not working.. arrrghhh..  It's the
> > simplest things in engineering that take the longest time... -:)
> >
> > Can you see why this may not work?
> >
> > /Users/xyz/hadoop-0.20.2/bin/hadoop jar
> > /Users/xyz/modules/something/target/my.jar com.xyz.common.MyMapReduce
> > -libjars /Users/xyz/modules/something/target/my.jar,
> > /Users/xyz/avro-tools-1.5.4.jar
> >
> > On Wed, Nov 16, 2011 at 8:51 AM, Friso van Vollenhoven
> > <[EMAIL PROTECTED]> wrote:
> >>
> >> You use maven jar-with-deps default assembly? That layout works too, but
> >> it will give you problems eventually when you have different classes
> with
> >> the same package and name.
> >> Java jar files are regular ZIP files. They can contain duplicate
> entries.
> >> I don't know whether your packaging creates duplicates in them, but if
> it
> >> does, it could be the cause of your problem.
> >> Try checking your jar for a duplicate license dir in the META-INF
> >> (something like: unzip -l <your-jar-name>.jar | awk '{print $4}' | sort
> |
> >> uniq -d)
> >>
> >> Friso
> >>
> >> On 16 nov. 2011, at 17:33, Something Something wrote:
> >>
> >> Thanks Bejoy & Friso.  When I use the all-in-one jar file created by
> Maven
> >> I get this:
> >>
> >> Mkdirs failed to create
> >> /Users/xyz/hdfs/hadoop-unjar4743660161930001886/META-INF/license
> >>
> >>
> >> Do you recall coming across this?  Our 'all-in-one' jar is not exactly
> how
> >> you have described it.  It doesn't contain any JARs, but it has all the
> >> classes from all the dependent JARs.
> >>
> >>
> >> On Wed, Nov 16, 2011 at 7:59 AM, Friso van Vollenhoven
> >> <[EMAIL PROTECTED]> wrote:
> >>>
> >>> We usually package my jobs as a single jar that contains a /lib
> directory
> >>> in the jar that contains all other jars that the job code depends on.
> Hadoop
> >>> understands this layout when run as 'hadoop jar'. So the jar layout
> would be
> >>> something like:
> >>> /META-INF/manifest.mf
> >>> /com/mypackage/MyMapperClass.class
> >>> /com/mypackage/MyReducerClass.class
> >>> /lib/dependency1.jar
> >>> /lib/dependency2.jar
> >>> etc.
> >>> If you use Maven or some other build tool with dependency management,
> you
> >>> can usually produce this jar as part of your build. We also have Maven
> write
> >>> the main class to the manifest, such that there is no need to type it.
> So
> >>> for us, submitting a job looks like:
> >>> hadoop jar jar-with-all-deps-in-lib.jar arg1 arg2 argN
> >>> Then Hadoop will take care of submitting and distributing, etc. Of
> course
> >>> you pay the penalty of always sending all of your dependencies over
> the wire
> >>> (the job jar gets replicated to 10 machines by
> default). Pre-distributing
> >>> sounds tedious and error prone to me. What if you have different jobs
> that
> >>> require different versions of the same dependency?
> >>>
> >>> HTH,
> >>> Friso
> >>>
> >>>
> >>>
> >>>
> >>> On 16 nov. 2011, at 15:42, Something Something wrote:
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB