Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> What's the best practice for managing Hadoop dependencie?


Copy link to this message
-
What's the best practice for managing Hadoop dependencie?
First of all, I want to claim that I used CDH5 beta, and managed project
using maven, and I googled and read a lot, e.g.
https://issues.apache.org/jira/browse/MAPREDUCE-1700
http://www.datasalt.com/2011/05/handling-dependencies-and-configuration-in-java-hadoop-projects-efficiently/

I believe the problem is quite common, when we write an MR job, we need
lots of dependencies,
which may not exist in or conflict with HADDOP_CLASSPATH.
There are several options, e.g.
1. add all libraries to my own JAR, and set HADOOP_USER_CLASSPATH_FIRST=true
   This is what I do, which makes the jar very big, and still it doesn't
work.
   e.g. I already packaged guava-16.0.jar in my package, but it still use
guava-11.0.2.jar in the HADDOP_CLASSPATH.
   below is my build configuration.
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <archive>
                        <manifest>
                            <mainClass>xxx.xxx.xxx.Runner</mainClass>
                        </manifest>
                    </archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

2. distinguish which library is not present in HADDOP_CLASSPATH, and put it
into DistributedCache
    I think it's hard to distinguish, and still if it conflicts, which
dependency would be precedent?
*What's the best practice, especially using maven?*

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB