I mentioned a few months ago that I was interested in creating a new
Scripting Engine for Pig based off of the R language. I have finally gotten
that project to a point where I feel comfortable sharing it with the Pig
This project can be found at: http://www.github.com/cd-wood/pigaddons
RScriptEngine is a scripting engine for Apache Pig that interprets the R
language <http://www.r-project.org/>. The goal behind this scripting engine
is compatability and ease of use of the R language in Amazon EMR jobs.
Included /scripts is the rpig-bootstrap.sh script, that is meant as a
bootstrap script for Amazon EMR instances; it can also be used on personal
instances to set up an environment compatible with the scripting engine.
This interpreter makes use of JRI <http://www.rforge.net/JRI/> to an
instance of R to run inside of the Java process.
By combining R with Pig, I feel that a large number of new analyses are
possible that can not be done natively in Pig; while there are already
other languages for creating UDFs, the more options the better.
A cool feature that is possible by including R in a big-data analysis
package is the ease of generating images / plotting data provided by R.
While not currently implemented, one upcoming feature is the integration of
JavaGD which will allow all images generated by the R script to be rendered
into a Java class, from which it might be possible to save, email, or do
other stuff with those saved images.
To showcase using R with Pig, I've included a Naive Bayes (contrived)
example that is a simplistic form of classifying emails as spam based off
of the presence of certain words.
I have tested this scripting engine on Pig 0.9.2 to make sure that it works
in Amazon EMR, however I haven't had a chance to test it in EMR yet. If
someone does, please let me know how it goes, and if anyone has more cool
examples of using R, I'd be happy to include them.
And of course, please let me know of any bugs you find or any other
suggestions you may have.