Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Managing pig script jar dependencies

Geoffrey Gallaway 2011-01-19, 22:24
Dmitriy Ryaboy 2011-01-21, 00:27
Kaluskar, Sanjay 2011-01-21, 01:41
Dmitriy Ryaboy 2011-01-21, 02:27
Kaluskar, Sanjay 2011-01-21, 03:31
Erik Onnen 2011-01-21, 06:44
Copy link to this message
Re: Managing pig script jar dependencies
In Oozie we run into a similar problem.

As workflows with pig actions proliferate the lib/ directory of each
workflow app had to contain Pig and dependent JARs. This becomes a nightmare
as to maintain as workflow app increase.

The approach to solve this was to add to oozie the concept of a sharelib/
directory in HDFS.

Then copy to the sharelib/ all the JARs you want to use across multiple
workflow applications.

When submitting a workflow you can specify the sharelib/ dir you want to use
or you can indicate Oozie to use the system sharelib/ (the default one).

Oozie then adds to the distributed cache for the for Pig job all the JARs in
the specified sharelib/

The benefits of this approach is that JAR files are only once in HDFS and
they can be managed and updated globally. And users won't miss a JAR by

This feature is coming in Oozie 2.3

Pig could easily have a -sharelib option that points to an HDFS sharelib/
directory thus achieving the same.

BTW, as Oozie supports submitting pig jobs over Oozie, doing 'oozie pig -f
....' you can get the feature for free, plus  that Oozie becomes a Pig
server (you get a job ID and you track progress later), all this without
having to write a workflow.

Hope this helps.

On Fri, Jan 21, 2011 at 2:44 PM, Erik Onnen <[EMAIL PROTECTED]> wrote:

> As a new member to the list, I offer our lone data point. We use the maven
> shade plugin: http://maven.apache.org/plugins/maven-shade-plugin/
> Shade produces an "uber" JAR with an optional declared main class.
> <http://maven.apache.org/plugins/maven-shade-plugin/>On the up side, for a
> reasonable number of dependencies (in our case ~40), it just works and
> results in a single JAR. We're lucky enough that across the board, we can
> use one JAR for launching a message consumer, an Hadoop Job, and a Pig job.
> <http://maven.apache.org/plugins/maven-shade-plugin/>That said, there are
> two caveats we've encountered:
> * System dependencies aren't rolled into the "uber" JAR - if you want
> something to be in the deployment artifact, you need to at a minimum put it
> into your local repo - we do this via bash scripting for HBase 0.90.0 for
> example.
> * Conflicts - so far we've managed to do a maven dependency:tree and
> exclude
> conflicting dependencies, but I'm sure there is a point where that will not
> work any more.
> I'd love to hear how others are solving the problem, so far this has worked
> for us.
> -erik
> On Thu, Jan 20, 2011 at 7:31 PM, Kaluskar, Sanjay <
> > wrote:
> > Hi Dmitriy,
> >
> > Well, what I have is still experimental & not in any product. But, yes
> > we can compile to a Pig script. I try to use the native relational
> > operators where possible & use UDFs in other cases.
> >
> > I don't understand which conflicts you are referring to. Initially, I
> > was trying to create a single jar (containing all the 300 dependencies)
> > using the maven-dependency-plugin (BTW that seems to be the recommended
> > approach & should work in many cases) but it turned out that some of our
> > internal components had conflicting file names for some of the resources
> > (should probably be fixed!). My current approach works better because I
> > don't try to re-package any dependency. Yes, startup times are slow - of
> > course, I am open to other ideas :-)
> >
> > -----Original Message-----
> > From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]]
> > Sent: 21 January 2011 07:57
> > Subject: Re: Managing pig script jar dependencies
> >
> > Sanjay,
> > Informatica compiles to Pig now, eh? Interesting...
> > How do you handle jar conflicts if you bundle the whole lot? Doesn't
> > this cost you a lot on job startup time?
> >
> > Dmitriy
> >
> >
> > On Thu, Jan 20, 2011 at 5:41 PM, Kaluskar, Sanjay
> > > wrote:
> >
> > > I have a similar problem and I can tell you what I am doing currently,
> >
> > > just in case it is useful. I have a tool that generates PIG scripts
Dmitriy Lyubimov 2011-01-22, 01:04
Dmitriy Lyubimov 2011-01-22, 01:00