|
zaki rahaman
2009-10-01, 15:19
Kevin Weil
2009-10-01, 19:00
Santhosh Srinivasan
2009-10-01, 19:06
Kevin Weil
2009-10-01, 19:13
zaki rahaman
2009-10-01, 19:24
Dmitriy Ryaboy
2009-10-01, 19:25
zaki rahaman
2009-10-01, 19:26
zaki rahaman
2009-10-01, 19:29
Dmitriy Ryaboy
2009-10-01, 19:59
zaki rahaman
2009-10-01, 20:19
Jeff Hammerbacher
2009-10-01, 21:25
zaki rahaman
2009-10-01, 21:52
|
-
Load Resource File for UDF on Clusterzaki rahaman 2009-10-01, 15:19
Hi All,
So I'm running into an issue in trying to use a UDF I wrote to do GeoIP location on IP addresses in tuples. I thought I could simply pack the source/class files along with the resource file (GeoIP.dat) into a JAR and Pig would be able to use the UDF properly. The structure of the JAR is as follows: resources/GeoIP.dat mypigudfs/*.class In the relevant Java source file, I make the following reference to the resource file: String dbpath getClass().getResource("/resources/GeoIP.dat").toExternalForm(); I end up getting a File Not Found error as for some reason the file is not shipped cluster.
-
Re: Load Resource File for UDF on ClusterKevin Weil 2009-10-01, 19:00
This may be sacrilege, but for files like GeoIP.dat that you will
consistently want, another strategy is to make them part of your datanode deployment/configuration. Have puppet or whatever you use put the GeoIP stuff in a common location on each datanode (/usr/local/geoip/GeoIP.dat or something) and then load it locally in your UDF. The other benefit of this with GeoIP specifically is that it allows you to update the data file without deploying a new jar, plus the size of the jar that you're sending all over the cluster gets reduced dramatically. Just a thought, Kevin On Thu, Oct 1, 2009 at 8:19 AM, zaki rahaman <[EMAIL PROTECTED]> wrote: > Hi All, > > So I'm running into an issue in trying to use a UDF I wrote to do GeoIP > location on IP addresses in tuples. I thought I could simply pack the > source/class files along with the resource file (GeoIP.dat) into a JAR and > Pig would be able to use the UDF properly. > > The structure of the JAR is as follows: > > resources/GeoIP.dat > mypigudfs/*.class > > In the relevant Java source file, I make the following reference to the > resource file: > > String dbpath > getClass().getResource("/resources/GeoIP.dat").toExternalForm(); > > I end up getting a File Not Found error as for some reason the file is not > shipped cluster. >
-
RE: Load Resource File for UDF on ClusterSanthosh Srinivasan 2009-10-01, 19:06
How about using the distributed cache?
-----Original Message----- From: Kevin Weil [mailto:[EMAIL PROTECTED]] Sent: Thursday, October 01, 2009 12:00 PM To: [EMAIL PROTECTED] Subject: Re: Load Resource File for UDF on Cluster This may be sacrilege, but for files like GeoIP.dat that you will consistently want, another strategy is to make them part of your datanode deployment/configuration. Have puppet or whatever you use put the GeoIP stuff in a common location on each datanode (/usr/local/geoip/GeoIP.dat or something) and then load it locally in your UDF. The other benefit of this with GeoIP specifically is that it allows you to update the data file without deploying a new jar, plus the size of the jar that you're sending all over the cluster gets reduced dramatically. Just a thought, Kevin On Thu, Oct 1, 2009 at 8:19 AM, zaki rahaman <[EMAIL PROTECTED]> wrote: > Hi All, > > So I'm running into an issue in trying to use a UDF I wrote to do > GeoIP location on IP addresses in tuples. I thought I could simply > pack the source/class files along with the resource file (GeoIP.dat) > into a JAR and Pig would be able to use the UDF properly. > > The structure of the JAR is as follows: > > resources/GeoIP.dat > mypigudfs/*.class > > In the relevant Java source file, I make the following reference to > the resource file: > > String dbpath > getClass().getResource("/resources/GeoIP.dat").toExternalForm(); > > I end up getting a File Not Found error as for some reason the file is > not shipped cluster. >
-
Re: Load Resource File for UDF on ClusterKevin Weil 2009-10-01, 19:13
This was kind of my point. You could definitely use the distributed cache
for this kind of thing, but for a file like GeoIP.dat that is used regularly and consistently (and not changed frequently), can can make it part of the deployment and simplify life. The distributed cache is a great option too, of course. Kevin On Thu, Oct 1, 2009 at 12:06 PM, Santhosh Srinivasan <[EMAIL PROTECTED]>wrote: > How about using the distributed cache? > > -----Original Message----- > From: Kevin Weil [mailto:[EMAIL PROTECTED]] > Sent: Thursday, October 01, 2009 12:00 PM > To: [EMAIL PROTECTED] > Subject: Re: Load Resource File for UDF on Cluster > > This may be sacrilege, but for files like GeoIP.dat that you will > consistently want, another strategy is to make them part of your > datanode deployment/configuration. Have puppet or whatever you use put > the GeoIP stuff in a common location on each datanode > (/usr/local/geoip/GeoIP.dat or something) and then load it locally in > your UDF. The other benefit of this with GeoIP specifically is that it > allows you to update the data file without deploying a new jar, plus the > size of the jar that you're sending all over the cluster gets reduced > dramatically. > Just a thought, > Kevin > > On Thu, Oct 1, 2009 at 8:19 AM, zaki rahaman <[EMAIL PROTECTED]> > wrote: > > > Hi All, > > > > So I'm running into an issue in trying to use a UDF I wrote to do > > GeoIP location on IP addresses in tuples. I thought I could simply > > pack the source/class files along with the resource file (GeoIP.dat) > > into a JAR and Pig would be able to use the UDF properly. > > > > The structure of the JAR is as follows: > > > > resources/GeoIP.dat > > mypigudfs/*.class > > > > In the relevant Java source file, I make the following reference to > > the resource file: > > > > String dbpath > > getClass().getResource("/resources/GeoIP.dat").toExternalForm(); > > > > I end up getting a File Not Found error as for some reason the file is > > > not shipped cluster. > > >
-
Re: Load Resource File for UDF on Clusterzaki rahaman 2009-10-01, 19:24
Thanks for the help.
Well I'm somewhat limited in that I'm using Pig via Elastic MapReduce for the time being while I wait to get a cluster deployed. How would I go about using the distributed cache to accomplish this? (I can ssh into my master node and launch hadoop from there if that helps) On Thu, Oct 1, 2009 at 3:13 PM, Kevin Weil <[EMAIL PROTECTED]> wrote: > This was kind of my point. You could definitely use the distributed cache > for this kind of thing, but for a file like GeoIP.dat that is used > regularly > and consistently (and not changed frequently), can can make it part of the > deployment and simplify life. > The distributed cache is a great option too, of course. > > Kevin > > On Thu, Oct 1, 2009 at 12:06 PM, Santhosh Srinivasan <[EMAIL PROTECTED] > >wrote: > > > How about using the distributed cache? > > > > -----Original Message----- > > From: Kevin Weil [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, October 01, 2009 12:00 PM > > To: [EMAIL PROTECTED] > > Subject: Re: Load Resource File for UDF on Cluster > > > > This may be sacrilege, but for files like GeoIP.dat that you will > > consistently want, another strategy is to make them part of your > > datanode deployment/configuration. Have puppet or whatever you use put > > the GeoIP stuff in a common location on each datanode > > (/usr/local/geoip/GeoIP.dat or something) and then load it locally in > > your UDF. The other benefit of this with GeoIP specifically is that it > > allows you to update the data file without deploying a new jar, plus the > > size of the jar that you're sending all over the cluster gets reduced > > dramatically. > > Just a thought, > > Kevin > > > > On Thu, Oct 1, 2009 at 8:19 AM, zaki rahaman <[EMAIL PROTECTED]> > > wrote: > > > > > Hi All, > > > > > > So I'm running into an issue in trying to use a UDF I wrote to do > > > GeoIP location on IP addresses in tuples. I thought I could simply > > > pack the source/class files along with the resource file (GeoIP.dat) > > > into a JAR and Pig would be able to use the UDF properly. > > > > > > The structure of the JAR is as follows: > > > > > > resources/GeoIP.dat > > > mypigudfs/*.class > > > > > > In the relevant Java source file, I make the following reference to > > > the resource file: > > > > > > String dbpath > > > getClass().getResource("/resources/GeoIP.dat").toExternalForm(); > > > > > > I end up getting a File Not Found error as for some reason the file is > > > > > not shipped cluster. > > > > > > -- Zaki Rahaman
-
Re: Load Resource File for UDF on ClusterDmitriy Ryaboy 2009-10-01, 19:25
Like Kevin points out, you shouldn't need to push the whole GeoIP
database around in the jar. You can make it part of your build or get it into the nodes in some other way, or you could keep the file on HDFS and use the distributed cache facility that Pig makes available for you. It's bundled into defining streaming commands, so we have to do a bit of a head fake here: DEFINE lsa `ls -a` cache('/hdfs/path/to/GeoIP.dat#GeoIP.dat'); (DEFINE semantics make us assign some command to some alias, which we aren't really going to use). -D On Thu, Oct 1, 2009 at 11:19 AM, zaki rahaman <[EMAIL PROTECTED]> wrote: > Hi All, > > So I'm running into an issue in trying to use a UDF I wrote to do GeoIP > location on IP addresses in tuples. I thought I could simply pack the > source/class files along with the resource file (GeoIP.dat) into a JAR and > Pig would be able to use the UDF properly. > > The structure of the JAR is as follows: > > resources/GeoIP.dat > mypigudfs/*.class > > In the relevant Java source file, I make the following reference to the > resource file: > > String dbpath > getClass().getResource("/resources/GeoIP.dat").toExternalForm(); > > I end up getting a File Not Found error as for some reason the file is not > shipped cluster. >
-
Re: Load Resource File for UDF on Clusterzaki rahaman 2009-10-01, 19:26
And also, I'm only using the Country DB (~1MB), in which case, jar size is
not a huge problem. From my understanding jars are autoshipped only once to each node so on a cluster of say 10 nodes, this isn't a huge deal. Or am I wrong? On Thu, Oct 1, 2009 at 3:24 PM, zaki rahaman <[EMAIL PROTECTED]> wrote: > Thanks for the help. > > Well I'm somewhat limited in that I'm using Pig via Elastic MapReduce for > the time being while I wait to get a cluster deployed. How would I go about > using the distributed cache to accomplish this? (I can ssh into my master > node and launch hadoop from there if that helps) > > > > On Thu, Oct 1, 2009 at 3:13 PM, Kevin Weil <[EMAIL PROTECTED]> wrote: > >> This was kind of my point. You could definitely use the distributed cache >> for this kind of thing, but for a file like GeoIP.dat that is used >> regularly >> and consistently (and not changed frequently), can can make it part of the >> deployment and simplify life. >> The distributed cache is a great option too, of course. >> >> Kevin >> >> On Thu, Oct 1, 2009 at 12:06 PM, Santhosh Srinivasan <[EMAIL PROTECTED] >> >wrote: >> >> > How about using the distributed cache? >> > >> > -----Original Message----- >> > From: Kevin Weil [mailto:[EMAIL PROTECTED]] >> > Sent: Thursday, October 01, 2009 12:00 PM >> > To: [EMAIL PROTECTED] >> > Subject: Re: Load Resource File for UDF on Cluster >> > >> > This may be sacrilege, but for files like GeoIP.dat that you will >> > consistently want, another strategy is to make them part of your >> > datanode deployment/configuration. Have puppet or whatever you use put >> > the GeoIP stuff in a common location on each datanode >> > (/usr/local/geoip/GeoIP.dat or something) and then load it locally in >> > your UDF. The other benefit of this with GeoIP specifically is that it >> > allows you to update the data file without deploying a new jar, plus the >> > size of the jar that you're sending all over the cluster gets reduced >> > dramatically. >> > Just a thought, >> > Kevin >> > >> > On Thu, Oct 1, 2009 at 8:19 AM, zaki rahaman <[EMAIL PROTECTED]> >> > wrote: >> > >> > > Hi All, >> > > >> > > So I'm running into an issue in trying to use a UDF I wrote to do >> > > GeoIP location on IP addresses in tuples. I thought I could simply >> > > pack the source/class files along with the resource file (GeoIP.dat) >> > > into a JAR and Pig would be able to use the UDF properly. >> > > >> > > The structure of the JAR is as follows: >> > > >> > > resources/GeoIP.dat >> > > mypigudfs/*.class >> > > >> > > In the relevant Java source file, I make the following reference to >> > > the resource file: >> > > >> > > String dbpath >> > > getClass().getResource("/resources/GeoIP.dat").toExternalForm(); >> > > >> > > I end up getting a File Not Found error as for some reason the file is >> > >> > > not shipped cluster. >> > > >> > >> > > > > -- > Zaki Rahaman > > -- Zaki Rahaman
-
Re: Load Resource File for UDF on Clusterzaki rahaman 2009-10-01, 19:29
Yeah I thought about using a dummy shell script/bash command to trick Pig
into caching the file but I assumed the caching wouldn't take place until after I actually called the alias. If I understand you correctly, the define statement itself accomplishes this. On Thu, Oct 1, 2009 at 3:25 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > Like Kevin points out, you shouldn't need to push the whole GeoIP > database around in the jar. > You can make it part of your build or get it into the nodes in some > other way, or you could keep the file on HDFS and use the distributed > cache facility that Pig makes available for you. It's bundled into > defining streaming commands, so we have to do a bit of a head fake > here: > > DEFINE lsa `ls -a` cache('/hdfs/path/to/GeoIP.dat#GeoIP.dat'); > > (DEFINE semantics make us assign some command to some alias, which we > aren't really going to use). > > -D > > On Thu, Oct 1, 2009 at 11:19 AM, zaki rahaman <[EMAIL PROTECTED]> > wrote: > > Hi All, > > > > So I'm running into an issue in trying to use a UDF I wrote to do GeoIP > > location on IP addresses in tuples. I thought I could simply pack the > > source/class files along with the resource file (GeoIP.dat) into a JAR > and > > Pig would be able to use the UDF properly. > > > > The structure of the JAR is as follows: > > > > resources/GeoIP.dat > > mypigudfs/*.class > > > > In the relevant Java source file, I make the following reference to the > > resource file: > > > > String dbpath > > getClass().getResource("/resources/GeoIP.dat").toExternalForm(); > > > > I end up getting a File Not Found error as for some reason the file is > not > > shipped cluster. > > > -- Zaki Rahaman
-
Re: Load Resource File for UDF on ClusterDmitriy Ryaboy 2009-10-01, 19:59
Hm, no, you are right. I just experimented with this and caching only
happens when the alias is first invoked. On Thu, Oct 1, 2009 at 3:29 PM, zaki rahaman <[EMAIL PROTECTED]> wrote: > Yeah I thought about using a dummy shell script/bash command to trick Pig > into caching the file but I assumed the caching wouldn't take place until > after I actually called the alias. If I understand you correctly, the define > statement itself accomplishes this. > > On Thu, Oct 1, 2009 at 3:25 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > >> Like Kevin points out, you shouldn't need to push the whole GeoIP >> database around in the jar. >> You can make it part of your build or get it into the nodes in some >> other way, or you could keep the file on HDFS and use the distributed >> cache facility that Pig makes available for you. It's bundled into >> defining streaming commands, so we have to do a bit of a head fake >> here: >> >> DEFINE lsa `ls -a` cache('/hdfs/path/to/GeoIP.dat#GeoIP.dat'); >> >> (DEFINE semantics make us assign some command to some alias, which we >> aren't really going to use). >> >> -D >> >> On Thu, Oct 1, 2009 at 11:19 AM, zaki rahaman <[EMAIL PROTECTED]> >> wrote: >> > Hi All, >> > >> > So I'm running into an issue in trying to use a UDF I wrote to do GeoIP >> > location on IP addresses in tuples. I thought I could simply pack the >> > source/class files along with the resource file (GeoIP.dat) into a JAR >> and >> > Pig would be able to use the UDF properly. >> > >> > The structure of the JAR is as follows: >> > >> > resources/GeoIP.dat >> > mypigudfs/*.class >> > >> > In the relevant Java source file, I make the following reference to the >> > resource file: >> > >> > String dbpath >> > getClass().getResource("/resources/GeoIP.dat").toExternalForm(); >> > >> > I end up getting a File Not Found error as for some reason the file is >> not >> > shipped cluster. >> > >> > > > > -- > Zaki Rahaman >
-
Re: Load Resource File for UDF on Clusterzaki rahaman 2009-10-01, 20:19
How then do I go about getting the file to all nodes via DistributedCache?
If there's not already there should be a general way to do this in Pig. I would be more than happy to help contribute towards this if someone files a patch on JIRA. Also, I noticed in searching the archives that there's currently no way to cache files for a UDF jar in the define statement. This seems like it would be an important functionality to have (since it's available for streaming commands). Again, if pointed in teh right direction I can help with this. On Thu, Oct 1, 2009 at 3:59 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > Hm, no, you are right. I just experimented with this and caching only > happens when the alias is first invoked. > > > On Thu, Oct 1, 2009 at 3:29 PM, zaki rahaman <[EMAIL PROTECTED]> > wrote: > > Yeah I thought about using a dummy shell script/bash command to trick Pig > > into caching the file but I assumed the caching wouldn't take place until > > after I actually called the alias. If I understand you correctly, the > define > > statement itself accomplishes this. > > > > On Thu, Oct 1, 2009 at 3:25 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > wrote: > > > >> Like Kevin points out, you shouldn't need to push the whole GeoIP > >> database around in the jar. > >> You can make it part of your build or get it into the nodes in some > >> other way, or you could keep the file on HDFS and use the distributed > >> cache facility that Pig makes available for you. It's bundled into > >> defining streaming commands, so we have to do a bit of a head fake > >> here: > >> > >> DEFINE lsa `ls -a` cache('/hdfs/path/to/GeoIP.dat#GeoIP.dat'); > >> > >> (DEFINE semantics make us assign some command to some alias, which we > >> aren't really going to use). > >> > >> -D > >> > >> On Thu, Oct 1, 2009 at 11:19 AM, zaki rahaman <[EMAIL PROTECTED]> > >> wrote: > >> > Hi All, > >> > > >> > So I'm running into an issue in trying to use a UDF I wrote to do > GeoIP > >> > location on IP addresses in tuples. I thought I could simply pack the > >> > source/class files along with the resource file (GeoIP.dat) into a JAR > >> and > >> > Pig would be able to use the UDF properly. > >> > > >> > The structure of the JAR is as follows: > >> > > >> > resources/GeoIP.dat > >> > mypigudfs/*.class > >> > > >> > In the relevant Java source file, I make the following reference to > the > >> > resource file: > >> > > >> > String dbpath > >> > getClass().getResource("/resources/GeoIP.dat").toExternalForm(); > >> > > >> > I end up getting a File Not Found error as for some reason the file is > >> not > >> > shipped cluster. > >> > > >> > > > > > > > > -- > > Zaki Rahaman > > > -- Zaki Rahaman
-
Re: Load Resource File for UDF on ClusterJeff Hammerbacher 2009-10-01, 21:25
Hey Zaki,
Why not just use Hadoop on AWS, but not through the EMR interface? Then you'll have access to the DistributedCache. Later, Jeff On Thu, Oct 1, 2009 at 1:19 PM, zaki rahaman <[EMAIL PROTECTED]> wrote: > How then do I go about getting the file to all nodes via DistributedCache? > > If there's not already there should be a general way to do this in Pig. I > would be more than happy to help contribute towards this if someone files a > patch on JIRA. > > Also, I noticed in searching the archives that there's currently no way to > cache files for a UDF jar in the define statement. This seems like it would > be an important functionality to have (since it's available for streaming > commands). Again, if pointed in teh right direction I can help with this. > > > > On Thu, Oct 1, 2009 at 3:59 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > > > Hm, no, you are right. I just experimented with this and caching only > > happens when the alias is first invoked. > > > > > > On Thu, Oct 1, 2009 at 3:29 PM, zaki rahaman <[EMAIL PROTECTED]> > > wrote: > > > Yeah I thought about using a dummy shell script/bash command to trick > Pig > > > into caching the file but I assumed the caching wouldn't take place > until > > > after I actually called the alias. If I understand you correctly, the > > define > > > statement itself accomplishes this. > > > > > > On Thu, Oct 1, 2009 at 3:25 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > wrote: > > > > > >> Like Kevin points out, you shouldn't need to push the whole GeoIP > > >> database around in the jar. > > >> You can make it part of your build or get it into the nodes in some > > >> other way, or you could keep the file on HDFS and use the distributed > > >> cache facility that Pig makes available for you. It's bundled into > > >> defining streaming commands, so we have to do a bit of a head fake > > >> here: > > >> > > >> DEFINE lsa `ls -a` cache('/hdfs/path/to/GeoIP.dat#GeoIP.dat'); > > >> > > >> (DEFINE semantics make us assign some command to some alias, which we > > >> aren't really going to use). > > >> > > >> -D > > >> > > >> On Thu, Oct 1, 2009 at 11:19 AM, zaki rahaman <[EMAIL PROTECTED] > > > > >> wrote: > > >> > Hi All, > > >> > > > >> > So I'm running into an issue in trying to use a UDF I wrote to do > > GeoIP > > >> > location on IP addresses in tuples. I thought I could simply pack > the > > >> > source/class files along with the resource file (GeoIP.dat) into a > JAR > > >> and > > >> > Pig would be able to use the UDF properly. > > >> > > > >> > The structure of the JAR is as follows: > > >> > > > >> > resources/GeoIP.dat > > >> > mypigudfs/*.class > > >> > > > >> > In the relevant Java source file, I make the following reference to > > the > > >> > resource file: > > >> > > > >> > String dbpath > > >> > getClass().getResource("/resources/GeoIP.dat").toExternalForm(); > > >> > > > >> > I end up getting a File Not Found error as for some reason the file > is > > >> not > > >> > shipped cluster. > > >> > > > >> > > > > > > > > > > > > -- > > > Zaki Rahaman > > > > > > > > > -- > Zaki Rahaman >
-
Re: Load Resource File for UDF on Clusterzaki rahaman 2009-10-01, 21:52
Hey Jeff,
That's the plan although I've still got some issues to iron out before I can proceed (waiting for the final release of CDH2, and I've been too lazy to get my datasets onto an EBS volume, if you've got some tips, I'd love to talk offline). So I guess there's basically no way to do this even by starting an Interactive session on EMR? On Thu, Oct 1, 2009 at 5:25 PM, Jeff Hammerbacher <[EMAIL PROTECTED]>wrote: > Hey Zaki, > Why not just use Hadoop on AWS, but not through the EMR interface? Then > you'll have access to the DistributedCache. > > Later, > Jeff > > On Thu, Oct 1, 2009 at 1:19 PM, zaki rahaman <[EMAIL PROTECTED]> > wrote: > > > How then do I go about getting the file to all nodes via > DistributedCache? > > > > If there's not already there should be a general way to do this in Pig. I > > would be more than happy to help contribute towards this if someone files > a > > patch on JIRA. > > > > Also, I noticed in searching the archives that there's currently no way > to > > cache files for a UDF jar in the define statement. This seems like it > would > > be an important functionality to have (since it's available for streaming > > commands). Again, if pointed in teh right direction I can help with this. > > > > > > > > On Thu, Oct 1, 2009 at 3:59 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > wrote: > > > > > Hm, no, you are right. I just experimented with this and caching only > > > happens when the alias is first invoked. > > > > > > > > > On Thu, Oct 1, 2009 at 3:29 PM, zaki rahaman <[EMAIL PROTECTED]> > > > wrote: > > > > Yeah I thought about using a dummy shell script/bash command to trick > > Pig > > > > into caching the file but I assumed the caching wouldn't take place > > until > > > > after I actually called the alias. If I understand you correctly, the > > > define > > > > statement itself accomplishes this. > > > > > > > > On Thu, Oct 1, 2009 at 3:25 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > > wrote: > > > > > > > >> Like Kevin points out, you shouldn't need to push the whole GeoIP > > > >> database around in the jar. > > > >> You can make it part of your build or get it into the nodes in some > > > >> other way, or you could keep the file on HDFS and use the > distributed > > > >> cache facility that Pig makes available for you. It's bundled into > > > >> defining streaming commands, so we have to do a bit of a head fake > > > >> here: > > > >> > > > >> DEFINE lsa `ls -a` cache('/hdfs/path/to/GeoIP.dat#GeoIP.dat'); > > > >> > > > >> (DEFINE semantics make us assign some command to some alias, which > we > > > >> aren't really going to use). > > > >> > > > >> -D > > > >> > > > >> On Thu, Oct 1, 2009 at 11:19 AM, zaki rahaman < > [EMAIL PROTECTED] > > > > > > >> wrote: > > > >> > Hi All, > > > >> > > > > >> > So I'm running into an issue in trying to use a UDF I wrote to do > > > GeoIP > > > >> > location on IP addresses in tuples. I thought I could simply pack > > the > > > >> > source/class files along with the resource file (GeoIP.dat) into a > > JAR > > > >> and > > > >> > Pig would be able to use the UDF properly. > > > >> > > > > >> > The structure of the JAR is as follows: > > > >> > > > > >> > resources/GeoIP.dat > > > >> > mypigudfs/*.class > > > >> > > > > >> > In the relevant Java source file, I make the following reference > to > > > the > > > >> > resource file: > > > >> > > > > >> > String dbpath > > > >> > getClass().getResource("/resources/GeoIP.dat").toExternalForm(); > > > >> > > > > >> > I end up getting a File Not Found error as for some reason the > file > > is > > > >> not > > > >> > shipped cluster. > > > >> > > > > >> > > > > > > > > > > > > > > > > -- > > > > Zaki Rahaman > > > > > > > > > > > > > > > -- > > Zaki Rahaman > > > -- Zaki Rahaman |