On Fri, Aug 2, 2013 at 9:50 PM, Patrick Wendell <[EMAIL PROTECTED]> wrote:
> Hey All,
> I'm working on SPARK-800 . The goal is to document a best practice or
> recommended way of bundling and running Spark jobs. We have a quickstart
> guide for writing a standlone job, but it doesn't cover how to deal with
> packaging up your dependencies and setting the correct environment
> variables required to submit a full job to a cluster. This can be a
> confusing process for beginners - it would be good to extend the guide to
> cover this.
> First though I wanted to sample this list and see how people tend to run
> Spark jobs inside their org's. Knowing any of the following would be
> - Do you create an uber jar with all of your job (and Spark)'s recursive
> - Do you try to use sbt run or maven exec with some way to pass the correct
> environment variables?
> - Do people use a modified version of spark's own `run` script?
> - Do you have some other way of submitting jobs?
> Any notes would be helpful in compiling this!
Now that Spark has been integrated into Bigtop:
it may make sense to tackle some of those issues from a
distribution perspective. Bigtop has a luxury of defining an
entire distribution (you always know what versions of Hadoop
and its ecosystem projects your're dealing with). It also
provides helper functionality for a lot of common things (like
finding JAVA_HOME, plugging into the underlying
OS capabilities, etc.).
I guess all I'm saying is that you guys should consider Bigtop
as an integration platform for making Spark easier to use.
Feel free to fork off this thread to dev@bigtop (CCed) if you
think this is an idea worth exploring.