I'm building a system for processing large RDF data sets with Hadoop.
The first stages are written in Java and perform the function of
normalizing, validating and cleaning up the data.
The stage that comes after this is going to subdivide Freebase into several
major "horizontal" subdivisions that users may or may not want. For
instance, Freebase uses two different vocabularies for expressing external
keys -- they both represent 100+ million plus facts so it's desirable to
pick one you like and throw the other in the bit bucket.
That phase will probably be written in Java, but to do the research to
figure out how to partition it, I want to do ad-hoc queries with Pig.
The first thing I'm working on is a input UDF for reading N-Triples files;
rather than deeply parsing the Nodes, I'm splitting the triples up into
three Texts. This process isn't too different from reading a white-space
separated file, but it's a little more complicated because sometimes there
are spaces in the object field. You also need to trim off a period and
maybe some whitespace at the end.
Now, it turns out the my UDF depends on classes I wrote distributed
throughout three different Maven projects (the PrimitiveTriple parser has
been around for a while) so I need to REGISTER multiple Jar files. I also
heavily use Guava and other third-party libraries so the list of things I
need to REGISTER is pretty big
What I'm trying now is to run this program
piping it like so
mvn dependency::build-classpath | mvn exec::java
This could be integrated into the maven build process in the future.
Anyway, is there a better way to do this?