You can do this in Pig by setting following -D switch at the command line
through which you invoke Pig.
In 0.8 release you will be able to do this from within Pig script like
set pig.streaming.ship.files myTopLevel.jar;
Note that this is just to unblock you. Its an internal Pig property which is
not exposed to the users and may break your script if your are also using
Streaming from within Pig. We need to find a long term solution for your
particular use case.
Hope it helps,
On Wed, Aug 11, 2010 at 09:30, Arun C Murthy <[EMAIL PROTECTED]> wrote:
> Moving to mapreduce-user@, bcc common-user@.
> Why do you need to create a single top-level jar? Just register each of
> your jars and put each in the distributed cache... however you have 150 jars
> which is a lot. Is there a way you can decrease that? I'm sure how you do
> this in pig, but in MR you have the ability to add a jar in the DC to the
> classpath of the child (DistributedCache.addFileToClassPath).
> Hope that helps.
> On Aug 11, 2010, at 12:48 AM, Kaluskar, Sanjay wrote:
> I am using Hadoop indirectly through PIG, and some of the UDFs (defined
>> by me) need other jars at runtime (around 150) some of which have
>> conflicting resource names. Hence, trying to unpack all of them and
>> repacking into a single jar doesn't work. My solution is to create a
>> single top-level jar that names all the dependencies in Class-Path in
>> the MANIFEST.MF. This is also simpler from a user's point of view. Of
>> course this requires the top-level jar and all the dependencies to be
>> created with a certain directory structure that I can control.
>> Currently, I have a structure where I have a root directory which
>> contains the top-level jar and a directory called lib, and all the
>> dependencies are in lib, and the top-level jar names the dependencies as
>> lib/x.jar lib/y.jar etc. I package all of this as a single zip file for
>> easy installation.
>> Just to be clear this is the dir structure:
>> root dir
>> |--- top-level.jar
>> |--- lib
>> |--- x.jar
>> |--- y.jar
>> I can't register top-level.jar in my PIG script (this is the recommended
>> approach) because PIG then unpacks & repackages everything into a single
>> jar, instead of including the jar on the classpath. I can't use
>> distributed cache because if I specify top-level.jar and lib separately
>> in mapred.cache.files, then the relative directory locations aren't
>> preserved. If I use the mapred.cache.archives option and specify the zip
>> file, I can't add the top-level jar to the classpath (because the
>> entries in mapred.job.classpath.files must be something from
>> If mapred.child.java.opts also allowed java.class.path to be augmented
>> (similar to java.library.path, which I am using for native libs that I
>> store in another dir parallel to lib), it would have solved my problem.
>> I could have specified the zip in mapred.cache.archives, and added the
>> jar to the classpath. Right now I can't see any solution, other than
>> using a shared file system and adding top-level.jar to HADOOP_CLASSPATH
>> - this works because I am using a small cluster that has a shared file
>> system but clearly it's not always feasible (and of course, it's
>> modifying Hadoop's environment).
>> Please suggest any alternatives you can think of.
Kaluskar, Sanjay 2010-08-12, 02:19
Mridul Muralidharan 2010-08-12, 07:32
Kaluskar, Sanjay 2010-08-12, 14:26
Mridul Muralidharan 2010-08-13, 06:32