Paul Houle 2013-08-08, 23:32
-Re: Best practices for handling dependencies in Java UDFs?
Ryan Compton 2013-08-08, 23:50
I often do this, and then just register one giant .jar
<!-- Plugin to create a single jar that includes all dependencies -->
On Thu, Aug 8, 2013 at 4:32 PM, Paul Houle <[EMAIL PROTECTED]> wrote:
> I'm building a system for processing large RDF data sets with Hadoop.
> The first stages are written in Java and perform the function of
> normalizing, validating and cleaning up the data.
> The stage that comes after this is going to subdivide Freebase into several
> major "horizontal" subdivisions that users may or may not want. For
> instance, Freebase uses two different vocabularies for expressing external
> keys -- they both represent 100+ million plus facts so it's desirable to
> pick one you like and throw the other in the bit bucket.
> That phase will probably be written in Java, but to do the research to
> figure out how to partition it, I want to do ad-hoc queries with Pig.
> The first thing I'm working on is a input UDF for reading N-Triples files;
> rather than deeply parsing the Nodes, I'm splitting the triples up into
> three Texts. This process isn't too different from reading a white-space
> separated file, but it's a little more complicated because sometimes there
> are spaces in the object field. You also need to trim off a period and
> maybe some whitespace at the end.
> Now, it turns out the my UDF depends on classes I wrote distributed
> throughout three different Maven projects (the PrimitiveTriple parser has
> been around for a while) so I need to REGISTER multiple Jar files. I also
> heavily use Guava and other third-party libraries so the list of things I
> need to REGISTER is pretty big
> What I'm trying now is to run this program
> piping it like so
> mvn dependency::build-classpath | mvn exec::java
> This could be integrated into the maven build process in the future.
> Anyway, is there a better way to do this?