Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Load Resource File for UDF on Cluster


Copy link to this message
-
RE: Load Resource File for UDF on Cluster
How about using the distributed cache?

-----Original Message-----
From: Kevin Weil [mailto:[EMAIL PROTECTED]]
Sent: Thursday, October 01, 2009 12:00 PM
To: [EMAIL PROTECTED]
Subject: Re: Load Resource File for UDF on Cluster

This may be sacrilege, but for files like GeoIP.dat that you will
consistently want, another strategy is to make them part of your
datanode deployment/configuration.  Have puppet or whatever you use put
the GeoIP stuff in a common location on each datanode
(/usr/local/geoip/GeoIP.dat or something) and then load it locally in
your UDF.  The other benefit of this with GeoIP specifically is that it
allows you to update the data file without deploying a new jar, plus the
size of the jar that you're sending all over the cluster gets reduced
dramatically.
Just a thought,
Kevin

On Thu, Oct 1, 2009 at 8:19 AM, zaki rahaman <[EMAIL PROTECTED]>
wrote:

> Hi All,
>
> So I'm running into an issue in trying to use a UDF I wrote to do
> GeoIP location on IP addresses in tuples. I thought I could simply
> pack the source/class files along with the resource file (GeoIP.dat)
> into a JAR and Pig would be able to use the UDF properly.
>
> The structure of the JAR is as follows:
>
> resources/GeoIP.dat
> mypigudfs/*.class
>
> In the relevant Java source file, I make the following reference to
> the resource file:
>
> String dbpath > getClass().getResource("/resources/GeoIP.dat").toExternalForm();
>
> I end up getting a File Not Found error as for some reason the file is

> not shipped cluster.
>