Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Best practices for handling dependencies in Java UDFs?

Paul Houle 2013-08-08, 23:32
Copy link to this message
Re: Best practices for handling dependencies in Java UDFs?
I often do this, and then just register one giant .jar

   <!-- Plugin to create a single jar that includes all dependencies -->


On Thu, Aug 8, 2013 at 4:32 PM, Paul Houle <[EMAIL PROTECTED]> wrote:
> I'm building a system for processing large RDF data sets with Hadoop.
> https://github.com/paulhoule/infovore/wiki
> The first stages are written in Java and perform the function of
> normalizing,  validating and cleaning up the data.
> The stage that comes after this is going to subdivide Freebase into several
> major "horizontal" subdivisions that users may or may not want.  For
> instance,  Freebase uses two different vocabularies for expressing external
> keys -- they both represent 100+ million plus facts so it's desirable to
> pick one you like and throw the other in the bit bucket.
> That phase will probably be written in Java,  but to do the research to
> figure out how to partition it,  I want to do ad-hoc queries with Pig.
> The first thing I'm working on is a input UDF for reading N-Triples files;
>  rather than deeply parsing the Nodes,  I'm splitting the triples up into
> three Texts.  This process isn't too different from reading a white-space
> separated file,  but it's a little more complicated because sometimes there
> are spaces in the object field.  You also need to trim off a period and
> maybe some whitespace at the end.
> Now,  it turns out the my UDF depends on classes I wrote distributed
> throughout three different Maven projects (the PrimitiveTriple parser has
> been around for a while) so I need to REGISTER multiple Jar files.  I also
> heavily use Guava and other third-party libraries so the list of things I
> need to REGISTER is pretty big
> What I'm trying now is to run this program
> https://github.com/paulhoule/infovore/blob/master/chopper/src/main/java/com/ontology2/chopper/tools/GenerateRegisterStatements.java
> piping it like so
> mvn dependency::build-classpath | mvn exec::java
> -Dexec.mainClass=com.ontology2.chopper.tools.GenerateRegisterStatements
> This could be integrated into the maven build process in the future.
> Anyway,  is there a better way to do this?