Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Best practices for handling dependencies in Java UDFs?

Copy link to this message
Best practices for handling dependencies in Java UDFs?
I'm building a system for processing large RDF data sets with Hadoop.


The first stages are written in Java and perform the function of
normalizing,  validating and cleaning up the data.

The stage that comes after this is going to subdivide Freebase into several
major "horizontal" subdivisions that users may or may not want.  For
instance,  Freebase uses two different vocabularies for expressing external
keys -- they both represent 100+ million plus facts so it's desirable to
pick one you like and throw the other in the bit bucket.

That phase will probably be written in Java,  but to do the research to
figure out how to partition it,  I want to do ad-hoc queries with Pig.

The first thing I'm working on is a input UDF for reading N-Triples files;
 rather than deeply parsing the Nodes,  I'm splitting the triples up into
three Texts.  This process isn't too different from reading a white-space
separated file,  but it's a little more complicated because sometimes there
are spaces in the object field.  You also need to trim off a period and
maybe some whitespace at the end.

Now,  it turns out the my UDF depends on classes I wrote distributed
throughout three different Maven projects (the PrimitiveTriple parser has
been around for a while) so I need to REGISTER multiple Jar files.  I also
heavily use Guava and other third-party libraries so the list of things I
need to REGISTER is pretty big

What I'm trying now is to run this program


piping it like so

mvn dependency::build-classpath | mvn exec::java

This could be integrated into the maven build process in the future.

Anyway,  is there a better way to do this?