Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # dev >> Custom Scripting Engine


+
Connor Woodson 2013-01-19, 02:42
+
Daniel Dai 2013-01-21, 21:59
+
Connor Woodson 2013-01-22, 00:47
+
Jonathan Coveney 2013-01-22, 01:04
+
Connor Woodson 2013-01-22, 02:22
+
Jonathan Coveney 2013-01-22, 23:56
+
Connor Woodson 2013-01-23, 00:15
+
Jonathan Coveney 2013-01-23, 00:44
Copy link to this message
-
Re: Custom Scripting Engine
There are two ways to go about using R with java (that I've found). Both
are a little bit of a hassle depending on your setup.

JRI is a JNI for R, so you don't need R installed on the machine for it to
work. But you do need to include a set of DLLs in the classpath; the best
way I've found to do this is to bundle the dll's in the .jar and then copy
them to the local directory at runtime (as copying them elsewhere and
changing java.library.path won't work). There are some features missing
from JRI, though, especially the ability for multiple
environments/sessions; I don't quite yet have down a plan for the R/Pig
integration, but having sessions might be useful.

The other method is through Rserve, which is both a java package and an
application; the application sets up an R server that by default allows
only a single connection from a local machine (if you wanted, each
map-reduce job could connect to the same R server/instance, but I don't
think that's useful). To start this up, you would need R installed and then
run Rserve. In EMR, this would be possible as it does have R, so you would
just need a bootstrap script to start R. Optionally, it is probably
possible to tell Rserve to start from within java, but that's much trickier.

I prefer the first method as it eliminates the requirement of having R
installed; however, I'm hoping to implement both (for Rserve, I'll require
that the server is already started; and maybe include an option for
connecting to a specific server).

I don't have a clear vision of how R/Pig will interact; it will have to be
something different than Python or JScript, but I don't know how different.
I want to just scratch out something basic and then try and evolve it from
there.

I'll go ahead and submit that Jira.

Thanks,

- Connor
On Tue, Jan 22, 2013 at 4:44 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:

> Ahhh, I see. That makes sense. Sadly, this won't currently be possible in
> the current version of Pig, but this is a really good reason to want to do
> this. Can you make a ticket about making it possible to plug in
> ScriptingEngines without having a make a code change to Pig? I think this
> would be useful for this reason.
>
> That said, if you dig down into how these implementations work, they are
> based on EvalFunc's, so manually making UDF's to do it is an annoyance, but
> functionally quite similar.
>
> Question about R: is there a JVM implementation, or are you shelling out?
>
>
> 2013/1/22 Connor Woodson <[EMAIL PROTECTED]>
>
> > I'm starting work on an R scripting engine; I'm not entirely sure how it
> > will be used, but I know that there have been attempts to get R working
> > with MapReduce / EMR and I thought it would be cool to do that through
> Pig.
> > (One fun use case might be to generate plots/graphs during the MR job
> (then
> > do something with them))
> >
> > The easy answer for how to get this working with Pig is to just stick new
> > scripting engines with the existing ones and update the ScriptingEngine
> > enum to include those; however, I would like to use this in EMR which
> > doesn't update its software regularly and so I was hoping there was some
> > hook to get this scripting engine called, but it looks like it'll just
> have
> > to be used for UDFs for now.
> >
> > If a change is going to be made, I think what would be helpful is a
> change
> > in how the ScriptingEngine decides which subclass  to call; right now
> (from
> > what I can tell) it will only look at the file suffix or the #! first
> line
> > of the script and try and match those with its internal list. Maybe allow
> > an annotation like
> > #@ <FQCN of a ScriptingEngine>
> > as the first line of a script to force Pig to use a specific engine.
> >
> > - Connor
> >
> >
> > On Tue, Jan 22, 2013 at 3:56 PM, Jonathan Coveney <[EMAIL PROTECTED]
> > >wrote:
> >
> > > So, something like this is not currently possible, but I think it would
> > be
> > > possible to expose a set of interfaces that would make this possible.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB