Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Best practices for handling dependencies in Java UDFs?


+
Paul Houle 2013-08-08, 23:32
Copy link to this message
-
Re: Best practices for handling dependencies in Java UDFs?
I often do this, and then just register one giant .jar

   <!-- Plugin to create a single jar that includes all dependencies -->
                        <plugin>
                                <artifactId>maven-assembly-plugin</artifactId>
                                <version>2.4</version>
                                <configuration>
                                        <descriptorRefs>

<descriptorRef>jar-with-dependencies</descriptorRef>
                                        </descriptorRefs>
                                </configuration>
                                <executions>
                                        <execution>
                                                <id>make-assembly</id>
                                                <phase>package</phase>
                                                <goals>
                                                        <goal>single</goal>
                                                </goals>
                                        </execution>
                                </executions>
                        </plugin>

On Thu, Aug 8, 2013 at 4:32 PM, Paul Houle <[EMAIL PROTECTED]> wrote:
> I'm building a system for processing large RDF data sets with Hadoop.
>
> https://github.com/paulhoule/infovore/wiki
>
> The first stages are written in Java and perform the function of
> normalizing,  validating and cleaning up the data.
>
> The stage that comes after this is going to subdivide Freebase into several
> major "horizontal" subdivisions that users may or may not want.  For
> instance,  Freebase uses two different vocabularies for expressing external
> keys -- they both represent 100+ million plus facts so it's desirable to
> pick one you like and throw the other in the bit bucket.
>
> That phase will probably be written in Java,  but to do the research to
> figure out how to partition it,  I want to do ad-hoc queries with Pig.
>
> The first thing I'm working on is a input UDF for reading N-Triples files;
>  rather than deeply parsing the Nodes,  I'm splitting the triples up into
> three Texts.  This process isn't too different from reading a white-space
> separated file,  but it's a little more complicated because sometimes there
> are spaces in the object field.  You also need to trim off a period and
> maybe some whitespace at the end.
>
> Now,  it turns out the my UDF depends on classes I wrote distributed
> throughout three different Maven projects (the PrimitiveTriple parser has
> been around for a while) so I need to REGISTER multiple Jar files.  I also
> heavily use Guava and other third-party libraries so the list of things I
> need to REGISTER is pretty big
>
> What I'm trying now is to run this program
>
> https://github.com/paulhoule/infovore/blob/master/chopper/src/main/java/com/ontology2/chopper/tools/GenerateRegisterStatements.java
>
> piping it like so
>
> mvn dependency::build-classpath | mvn exec::java
> -Dexec.mainClass=com.ontology2.chopper.tools.GenerateRegisterStatements
>
> This could be integrated into the maven build process in the future.
>
> Anyway,  is there a better way to do this?