Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # dev >> [propsoal] Changing classpath creation for hadoop to just use `hadoop classpath` 's output


Copy link to this message
-
[propsoal] Changing classpath creation for hadoop to just use `hadoop classpath` 's output
tl;dr
Ideally the generation of hadoop+accumulo's classpath should only be done
in one place.  At least for all versions of hadoop i've seen in the 5
years, there is one place to get hadoop's classpath (the `hadoop classpath`
command).  Why not use it?

----

For the hadoop, I've found that hadoop2 package from bigtop deploy to
different locations than the stock hadoop2 tarballs and require yet another
set of hadoop path wrangling between hadoop1, hadoop2, and different
deployment mechanisms.  It's kind of a mess.

hadoop1:
$HADOOP_HOME/*.jar
$HADOOP_HOME/lib/*.jar

hadoop2 tarball
$HADOOP_HOME/lib/share/hadoop/common/*.jar
$HADOOP_HOME/lib/share/hadoop/common/lib/*.jar
$HADOOP_HOME/lib/share/hadoop/hdfs/*.jar
$HADOOP_HOME/lib/share/hadoop/hdfs/lib/*.jar
$HADOOP_HOME/lib/share/hadoop/mapreduce/*.jar
$HADOOP_HOME/lib/share/hadoop/mapreduce/lib/*.jar

hadoop2 bigtop rpm
/usr/lib/hadoop-mapreduce ...
/usr/lib/hadoop-hdfs ...
/usr/lib/hadoop-yarn ...

There is a script in place that already generates hadoop classpaths that is
consistent across for multiple versions and deployments -- by using the
`hadoop classpath` command. Why not just use the hadoop classspath
generated by running `hadoop classpath` instead of trying to create
classpath in java/python/xml code and having to modify for each kind of
hadoop?

Does this seem reasonable?  Where would there be trouble spots?

Semi-related,  I've found that if HADOOP_HOME/lib doesn't contain certain
jars (accumulo depends on them) the Platform or Main wrapper programs can
fail.

A follow on would be to consolidate the 3 accumulo+hadoop classpath
generation locations into one, or at least to factor out the base
hadoop+accumulo parts.

The accumulo+hadoop classpath generation happens in  3 places .
1) hard coded with hadoop1 locations in java in the AccumuloClassLoader,
2) as comments in the accumulo-site.xml.example file,
3) in the soon to be deprecated TestUtils.py (in the system/auto test
suite).

I don't quite understand all of the classloader magic yet, but I'd wager
that it is used for extensions like iterators.  Could we just have one
initial point of accumulo+hadoop classpath generation and then use the xml
config to add more jars that the nested classloaders to handle the magjc
for dfs jarloading and for iterator extensions?

Thanks,
Jon.

--
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// [EMAIL PROTECTED]