Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Managing pig script jar dependencies


+
Geoffrey Gallaway 2011-01-19, 22:24
+
Dmitriy Ryaboy 2011-01-21, 00:27
+
Kaluskar, Sanjay 2011-01-21, 01:41
+
Dmitriy Ryaboy 2011-01-21, 02:27
+
Kaluskar, Sanjay 2011-01-21, 03:31
+
Erik Onnen 2011-01-21, 06:44
Copy link to this message
-
Re: Managing pig script jar dependencies
In Oozie we run into a similar problem.

As workflows with pig actions proliferate the lib/ directory of each
workflow app had to contain Pig and dependent JARs. This becomes a nightmare
as to maintain as workflow app increase.

The approach to solve this was to add to oozie the concept of a sharelib/
directory in HDFS.

Then copy to the sharelib/ all the JARs you want to use across multiple
workflow applications.

When submitting a workflow you can specify the sharelib/ dir you want to use
or you can indicate Oozie to use the system sharelib/ (the default one).

Oozie then adds to the distributed cache for the for Pig job all the JARs in
the specified sharelib/

The benefits of this approach is that JAR files are only once in HDFS and
they can be managed and updated globally. And users won't miss a JAR by
mistake.

This feature is coming in Oozie 2.3

Pig could easily have a -sharelib option that points to an HDFS sharelib/
directory thus achieving the same.

<ad>
BTW, as Oozie supports submitting pig jobs over Oozie, doing 'oozie pig -f
....' you can get the feature for free, plus  that Oozie becomes a Pig
server (you get a job ID and you track progress later), all this without
having to write a workflow.
</ad>

Hope this helps.

Alejandro
On Fri, Jan 21, 2011 at 2:44 PM, Erik Onnen <[EMAIL PROTECTED]> wrote:

> As a new member to the list, I offer our lone data point. We use the maven
> shade plugin: http://maven.apache.org/plugins/maven-shade-plugin/
>
> Shade produces an "uber" JAR with an optional declared main class.
>
> <http://maven.apache.org/plugins/maven-shade-plugin/>On the up side, for a
> reasonable number of dependencies (in our case ~40), it just works and
> results in a single JAR. We're lucky enough that across the board, we can
> use one JAR for launching a message consumer, an Hadoop Job, and a Pig job.
>
> <http://maven.apache.org/plugins/maven-shade-plugin/>That said, there are
> two caveats we've encountered:
> * System dependencies aren't rolled into the "uber" JAR - if you want
> something to be in the deployment artifact, you need to at a minimum put it
> into your local repo - we do this via bash scripting for HBase 0.90.0 for
> example.
> * Conflicts - so far we've managed to do a maven dependency:tree and
> exclude
> conflicting dependencies, but I'm sure there is a point where that will not
> work any more.
>
> I'd love to hear how others are solving the problem, so far this has worked
> for us.
>
> -erik
>
>
> On Thu, Jan 20, 2011 at 7:31 PM, Kaluskar, Sanjay <
> [EMAIL PROTECTED]
> > wrote:
>
> > Hi Dmitriy,
> >
> > Well, what I have is still experimental & not in any product. But, yes
> > we can compile to a Pig script. I try to use the native relational
> > operators where possible & use UDFs in other cases.
> >
> > I don't understand which conflicts you are referring to. Initially, I
> > was trying to create a single jar (containing all the 300 dependencies)
> > using the maven-dependency-plugin (BTW that seems to be the recommended
> > approach & should work in many cases) but it turned out that some of our
> > internal components had conflicting file names for some of the resources
> > (should probably be fixed!). My current approach works better because I
> > don't try to re-package any dependency. Yes, startup times are slow - of
> > course, I am open to other ideas :-)
> >
> > -----Original Message-----
> > From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]]
> > Sent: 21 January 2011 07:57
> > To: [EMAIL PROTECTED]
> > Subject: Re: Managing pig script jar dependencies
> >
> > Sanjay,
> > Informatica compiles to Pig now, eh? Interesting...
> > How do you handle jar conflicts if you bundle the whole lot? Doesn't
> > this cost you a lot on job startup time?
> >
> > Dmitriy
> >
> >
> > On Thu, Jan 20, 2011 at 5:41 PM, Kaluskar, Sanjay
> > <[EMAIL PROTECTED]
> > > wrote:
> >
> > > I have a similar problem and I can tell you what I am doing currently,
> >
> > > just in case it is useful. I have a tool that generates PIG scripts
+
Dmitriy Lyubimov 2011-01-22, 01:04
+
Dmitriy Lyubimov 2011-01-22, 01:00
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB