|
Jameson Li
2011-06-13, 11:07
Daniel Dai
2011-06-14, 17:24
Jameson Li
2011-06-16, 13:09
Daniel Dai
2011-06-16, 18:25
Jameson Li
2011-06-17, 02:05
Jameson Li
2011-06-17, 09:46
|
-
How to get/operate the InputFileName in pig 0.8.1Jameson Li 2011-06-13, 11:07
Hi,
I hava some files in the hdfs://path/load/ like this: file_29_00001 file_47_00001 file_16_00001 ... These files are generate by other M/R jobs. The files are only contains one column, and the number in the file name between 'file_' and '_00001' is a id. I want to add the id into its input format like this(I think I should to write a LoadFunc to get the id): a = load '/path/load/' as com.company.pig.GetIDFromFileName(); dump a; //here the parameter 'a' will have two columns:one is the origin column and the other is the id. And my question are these: 1, Does there have the existing func that I can get the id from the file name? 2, I think the method in pig 0.6.0 can help me: *bindTo<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/builtin/PigStorage.html#bindTo(java.lang.String, org.apache.pig.impl.io.BufferedPositionedInputStream, long, long)>*(String<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html?is-external=true> fileName, BufferedPositionedInputStream<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/impl/io/BufferedPositionedInputStream.html> in, long offset, long end) Specifies a portion of an InputStream to read tuples. but I can't find the same method in pig 0.8.1. Which method can I use to operate the input file in the pig 0.8.1 API? Thanks very much.
-
Re: How to get/operate the InputFileName in pig 0.8.1Daniel Dai 2011-06-14, 17:24
Check http://wiki.apache.org/pig/PigStorageWithInputPath, also you will
need to disable split combination: -Dpig.noSplitCombination=true Daniel On 06/13/2011 04:07 AM, Jameson Li wrote: > Hi, > > I hava some files in the hdfs://path/load/ like this: > file_29_00001 > file_47_00001 > file_16_00001 > ... > These files are generate by other M/R jobs. The files are only contains one > column, and the number in the file name between 'file_' and '_00001' is a > id. > I want to add the id into its input format like this(I think I should to > write a LoadFunc to get the id): > a = load '/path/load/' as com.company.pig.GetIDFromFileName(); > dump a; > //here the parameter 'a' will have two columns:one is the origin column and > the other is the id. > > And my question are these: > 1, Does there have the existing func that I can get the id from the file > name? > 2, I think the method in pig 0.6.0 can help me: > *bindTo<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/builtin/PigStorage.html#bindTo(java.lang.String, > org.apache.pig.impl.io.BufferedPositionedInputStream, long, > long)>*(String<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html?is-external=true> > fileName, BufferedPositionedInputStream<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/impl/io/BufferedPositionedInputStream.html> > in, > long offset, long end) > Specifies a portion of an InputStream to read tuples. > but I can't find the same method in pig 0.8.1. > Which method can I use to operate the input file in the pig 0.8.1 API? > > Thanks very much.
-
Re: How to get/operate the InputFileName in pig 0.8.1Jameson Li 2011-06-16, 13:09
Great. Depend on the
wiki:http://wiki.apache.org/pig/PigStorageWithInputPath and the setting:-Dpig.noSplitCombination=true, I can get the filename in the pig. But I have another problem. I modify the UDF code and ant it and generate the newest jar file(I am sure the jar file has updated) pig -x local register /home/user/project/lib/myUDF.jar a = load 'aaa'; b = foreach a generate com.company.pig.myUDF(); dump b; I found that the result has been using the old jar file and UDF class, and I think UDF classes has been caced somewhere. Am I right? And how to using the really newest jar file after re-compile? Thanks very much. 2011/6/15 Daniel Dai <[EMAIL PROTECTED]> > Check http://wiki.apache.org/pig/PigStorageWithInputPath, also you will > need to disable split combination: -Dpig.noSplitCombination=true > > Daniel > > > On 06/13/2011 04:07 AM, Jameson Li wrote: > > Hi, > > I hava some files in the hdfs://path/load/ like this: > file_29_00001 > file_47_00001 > file_16_00001 > ... > These files are generate by other M/R jobs. The files are only contains one > column, and the number in the file name between 'file_' and '_00001' is a > id. > I want to add the id into its input format like this(I think I should to > write a LoadFunc to get the id): > a = load '/path/load/' as com.company.pig. > GetIDFromFileName(); > dump a; > //here the parameter 'a' will have two columns:one is the origin column and > the other is the id. > > And my question are these: > 1, Does there have the existing func that I can get the id from the file > name? > 2, I think the method in pig 0.6.0 can help me: > *bindTo<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/builtin/PigStorage.html#bindTo(java.lang.String, > org.apache.pig.impl.io.BufferedPositionedInputStream, long, > long)> <http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/builtin/PigStorage.html#bindTo(java.lang.String,org.apache.pig.impl.io.BufferedPositionedInputStream,long,long)>*(String<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html?is-external=true> <http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html?is-external=true> > fileName, BufferedPositionedInputStream<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/impl/io/BufferedPositionedInputStream.html> <http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/impl/io/BufferedPositionedInputStream.html> > > > in, > long offset, long end) > Specifies a portion of an InputStream to read tuples. > but I can't find the same method in pig 0.8.1. > Which method can I use to operate the input file in the pig 0.8.1 API? > > Thanks very much. > > >
-
Re: How to get/operate the InputFileName in pig 0.8.1Daniel Dai 2011-06-16, 18:25
Should not be. Pig does not cache myUDF.jar. Every run will pick
myUDF.jar again from /home/user/project/lib. Daniel On 06/16/2011 06:09 AM, Jameson Li wrote: > Great. Depend onthe > wiki:http://wiki.apache.org/pig/PigStorageWithInputPath and the > setting:-Dpig.noSplitCombination=true, I can get the filename in the pig. > > But I have another problem. > I modify the UDF code and ant it and generate the newest jar file(I am > sure the jar file has updated) > pig -x local > register /home/user/project/lib/myUDF.jar > a = load 'aaa'; > b = foreach a generate com.company.pig.myUDF(); > dump b; > > I found that the result has been using the old jar file and UDF class, > and I think UDF classes has been caced somewhere. > > Am I right? > And how to using the really newest jar file after re-compile? > > Thanks very much. > > 2011/6/15 Daniel Dai <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> > > Check http://wiki.apache.org/pig/PigStorageWithInputPath, also you > will need to disable split combination: -Dpig.noSplitCombination=true > > Daniel > > > On 06/13/2011 04:07 AM, Jameson Li wrote: >> Hi, I hava some files in the hdfs://path/load/ like this: >> file_29_00001 file_47_00001 file_16_00001 ... These files are >> generate by other M/R jobs. The files are only contains one >> column, and the number in the file name between 'file_' and >> '_00001' is a id. I want to add the id into its input format like >> this(I think I should to write a LoadFunc to get the id): a >> load '/path/load/' as com.company.pig. >> GetIDFromFileName(); >> dump a; >> //here the parameter 'a' will have two columns:one is the origin column and >> the other is the id. >> >> And my question are these: >> 1, Does there have the existing func that I can get the id from the file >> name? >> 2, I think the method in pig 0.6.0 can help me: >> *bindTo<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/builtin/PigStorage.html#bindTo(java.lang.String, >> org.apache.pig.impl.io.BufferedPositionedInputStream, long, >> long)> <http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/builtin/PigStorage.html#bindTo%28java.lang.String,org.apache.pig.impl.io.BufferedPositionedInputStream,long,long%29>*(String<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html?is-external=true> >> fileName, BufferedPositionedInputStream<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/impl/io/BufferedPositionedInputStream.html> >> in, long offset, long end) Specifies a portion of an InputStream >> to read tuples. but I can't find the same method in pig 0.8.1. >> Which method can I use to operate the input file in the pig 0.8.1 >> API? Thanks very much. > >
-
Re: How to get/operate the InputFileName in pig 0.8.1Jameson Li 2011-06-17, 02:05
I am sorry that I have a fault.
My newest jar file is in the dir /home/user/project/lib/myUDF.jar, but there has an old jar file in the pig lib dir $PIG-HOME/lib(/opt/pig/lib ). Unfortunately after registering the jar file--/home/user/project/lib/myUDF.jar, when the pig code execuded, it will first scan the UDF classes in the pig lib jar files. 2011/6/17 Daniel Dai <[EMAIL PROTECTED]> > Should not be. Pig does not cache myUDF.jar. Every run will pick myUDF.jar > again from /home/user/project/lib. >
-
Re: How to get/operate the InputFileName in pig 0.8.1Jameson Li 2011-06-17, 09:46
Another question:
The class *org.apache.pig.piggybank.storage.MultiStorage *can help me to store the Pig output into different directories. But the I want to let the file not contain the 'splitFieldIndex'. For example: Input file: id name -------- 1 jack 1 tom 1 lily 2 cat 2 pig 2 bird After using MultiStorage('/my/home/output','0', 'bz2', '\\t') , normally, I will get the below files and their contents: 1/1-0 ------ 1 jack 1 tom 1 lily 2/2-0 ------ 2 cat 2 pig 2 bird I want to get the files and their contents: 1/1-0 ------ jack tom lily 2/2-0 ------ cat pig bird Is there a switch that I can use to generate the store file that do or do not contains the 'splitFieldIndex'? I have seen the code it seems that the answer is No. Maybe I have to write another class like MultiStorageSwithWriteKey to extends the class MultiStorageSwithKey. Am I right? Thanks very much. 2011/6/17 Jameson Li <[EMAIL PROTECTED]> > I am sorry that I have a fault. > My newest jar file is in the dir /home/user/project/lib/myUDF.jar, but > there has an old jar file in the pig lib dir $PIG-HOME/lib(/opt/pig/lib ). > Unfortunately after registering the jar > file--/home/user/project/lib/myUDF.jar, when the pig code execuded, it will > first scan the UDF classes in the pig lib jar files. > > 2011/6/17 Daniel Dai <[EMAIL PROTECTED]> > >> Should not be. Pig does not cache myUDF.jar. Every run will pick myUDF.jar >> again from /home/user/project/lib. >> > > |