-Re: Managing pig script jar dependencies
Alejandro Abdelnur 2011-01-21, 07:23
In Oozie we run into a similar problem.
As workflows with pig actions proliferate the lib/ directory of each
workflow app had to contain Pig and dependent JARs. This becomes a nightmare
as to maintain as workflow app increase.
The approach to solve this was to add to oozie the concept of a sharelib/
directory in HDFS.
Then copy to the sharelib/ all the JARs you want to use across multiple
When submitting a workflow you can specify the sharelib/ dir you want to use
or you can indicate Oozie to use the system sharelib/ (the default one).
Oozie then adds to the distributed cache for the for Pig job all the JARs in
the specified sharelib/
The benefits of this approach is that JAR files are only once in HDFS and
they can be managed and updated globally. And users won't miss a JAR by
This feature is coming in Oozie 2.3
Pig could easily have a -sharelib option that points to an HDFS sharelib/
directory thus achieving the same.
BTW, as Oozie supports submitting pig jobs over Oozie, doing 'oozie pig -f
....' you can get the feature for free, plus that Oozie becomes a Pig
server (you get a job ID and you track progress later), all this without
having to write a workflow.
Hope this helps.
On Fri, Jan 21, 2011 at 2:44 PM, Erik Onnen <[EMAIL PROTECTED]> wrote:
> As a new member to the list, I offer our lone data point. We use the maven
> shade plugin: http://maven.apache.org/plugins/maven-shade-plugin/
> Shade produces an "uber" JAR with an optional declared main class.
> <http://maven.apache.org/plugins/maven-shade-plugin/>On the up side, for a
> reasonable number of dependencies (in our case ~40), it just works and
> results in a single JAR. We're lucky enough that across the board, we can
> use one JAR for launching a message consumer, an Hadoop Job, and a Pig job.
> <http://maven.apache.org/plugins/maven-shade-plugin/>That said, there are
> two caveats we've encountered:
> * System dependencies aren't rolled into the "uber" JAR - if you want
> something to be in the deployment artifact, you need to at a minimum put it
> into your local repo - we do this via bash scripting for HBase 0.90.0 for
> * Conflicts - so far we've managed to do a maven dependency:tree and
> conflicting dependencies, but I'm sure there is a point where that will not
> work any more.
> I'd love to hear how others are solving the problem, so far this has worked
> for us.
> On Thu, Jan 20, 2011 at 7:31 PM, Kaluskar, Sanjay <
> [EMAIL PROTECTED]
> > wrote:
> > Hi Dmitriy,
> > Well, what I have is still experimental & not in any product. But, yes
> > we can compile to a Pig script. I try to use the native relational
> > operators where possible & use UDFs in other cases.
> > I don't understand which conflicts you are referring to. Initially, I
> > was trying to create a single jar (containing all the 300 dependencies)
> > using the maven-dependency-plugin (BTW that seems to be the recommended
> > approach & should work in many cases) but it turned out that some of our
> > internal components had conflicting file names for some of the resources
> > (should probably be fixed!). My current approach works better because I
> > don't try to re-package any dependency. Yes, startup times are slow - of
> > course, I am open to other ideas :-)
> > -----Original Message-----
> > From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]]
> > Sent: 21 January 2011 07:57
> > To: [EMAIL PROTECTED]
> > Subject: Re: Managing pig script jar dependencies
> > Sanjay,
> > Informatica compiles to Pig now, eh? Interesting...
> > How do you handle jar conflicts if you bundle the whole lot? Doesn't
> > this cost you a lot on job startup time?
> > Dmitriy
> > On Thu, Jan 20, 2011 at 5:41 PM, Kaluskar, Sanjay
> > <[EMAIL PROTECTED]
> > > wrote:
> > > I have a similar problem and I can tell you what I am doing currently,
> > > just in case it is useful. I have a tool that generates PIG scripts