Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Best practices for handling dependencies in Java UDFs?


Copy link to this message
-
Re: Best practices for handling dependencies in Java UDFs?
I often do this, and then just register one giant .jar

   <!-- Plugin to create a single jar that includes all dependencies -->
                        <plugin>
                                <artifactId>maven-assembly-plugin</artifactId>
                                <version>2.4</version>
                                <configuration>
                                        <descriptorRefs>

<descriptorRef>jar-with-dependencies</descriptorRef>
                                        </descriptorRefs>
                                </configuration>
                                <executions>
                                        <execution>
                                                <id>make-assembly</id>
                                                <phase>package</phase>
                                                <goals>
                                                        <goal>single</goal>
                                                </goals>
                                        </execution>
                                </executions>
                        </plugin>

On Thu, Aug 8, 2013 at 4:32 PM, Paul Houle <[EMAIL PROTECTED]> wrote:
> I'm building a system for processing large RDF data sets with Hadoop.
>
> https://github.com/paulhoule/infovore/wiki
>
> The first stages are written in Java and perform the function of
> normalizing,  validating and cleaning up the data.
>
> The stage that comes after this is going to subdivide Freebase into several
> major "horizontal" subdivisions that users may or may not want.  For
> instance,  Freebase uses two different vocabularies for expressing external
> keys -- they both represent 100+ million plus facts so it's desirable to
> pick one you like and throw the other in the bit bucket.
>
> That phase will probably be written in Java,  but to do the research to
> figure out how to partition it,  I want to do ad-hoc queries with Pig.
>
> The first thing I'm working on is a input UDF for reading N-Triples files;
>  rather than deeply parsing the Nodes,  I'm splitting the triples up into
> three Texts.  This process isn't too different from reading a white-space
> separated file,  but it's a little more complicated because sometimes there
> are spaces in the object field.  You also need to trim off a period and
> maybe some whitespace at the end.
>
> Now,  it turns out the my UDF depends on classes I wrote distributed
> throughout three different Maven projects (the PrimitiveTriple parser has
> been around for a while) so I need to REGISTER multiple Jar files.  I also
> heavily use Guava and other third-party libraries so the list of things I
> need to REGISTER is pretty big
>
> What I'm trying now is to run this program
>
> https://github.com/paulhoule/infovore/blob/master/chopper/src/main/java/com/ontology2/chopper/tools/GenerateRegisterStatements.java
>
> piping it like so
>
> mvn dependency::build-classpath | mvn exec::java
> -Dexec.mainClass=com.ontology2.chopper.tools.GenerateRegisterStatements
>
> This could be integrated into the maven build process in the future.
>
> Anyway,  is there a better way to do this?
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB