Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> How to get/operate the InputFileName in pig 0.8.1


Copy link to this message
-
Re: How to get/operate the InputFileName in pig 0.8.1
Great. Depend on the
wiki:http://wiki.apache.org/pig/PigStorageWithInputPath and
the setting:-Dpig.noSplitCombination=true, I can get the filename in the
pig.

But I have another problem.
I modify the UDF code and ant it and generate the newest jar file(I am sure
the jar file has updated)
pig -x local
register /home/user/project/lib/myUDF.jar
a = load 'aaa';
b = foreach a generate com.company.pig.myUDF();
dump b;

I found that the result has been using the old jar file and UDF class, and I
think UDF classes has been caced somewhere.

Am I right?
And how to using the really newest jar file after re-compile?

Thanks very much.

2011/6/15 Daniel Dai <[EMAIL PROTECTED]>

>  Check http://wiki.apache.org/pig/PigStorageWithInputPath, also you will
> need to disable split combination: -Dpig.noSplitCombination=true
>
> Daniel
>
>
> On 06/13/2011 04:07 AM, Jameson Li wrote:
>
> Hi,
>
> I hava some files in the hdfs://path/load/ like this:
> file_29_00001
> file_47_00001
> file_16_00001
> ...
> These files are generate by other M/R jobs. The files are only contains one
> column, and the number in the file name between 'file_' and '_00001' is a
> id.
> I want to add the id into its input format like this(I think I should to
> write a LoadFunc to get the id):
> a = load '/path/load/' as com.company.pig.
> GetIDFromFileName();
> dump a;
> //here the parameter 'a' will have two columns:one is the origin column and
> the other is the id.
>
> And my question are these:
> 1, Does there have the existing func that I can get the id from the file
> name?
> 2, I think the method in pig 0.6.0 can help me:
> *bindTo<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/builtin/PigStorage.html#bindTo(java.lang.String,
> org.apache.pig.impl.io.BufferedPositionedInputStream, long,
> long)> <http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/builtin/PigStorage.html#bindTo(java.lang.String,org.apache.pig.impl.io.BufferedPositionedInputStream,long,long)>*(String<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html?is-external=true> <http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html?is-external=true>
>  fileName, BufferedPositionedInputStream<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/impl/io/BufferedPositionedInputStream.html> <http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/impl/io/BufferedPositionedInputStream.html>
>
>
> in,
> long offset, long end)
>           Specifies a portion of an InputStream to read tuples.
> but I can't find the same method in pig 0.8.1.
> Which method can I use to operate the input file in the pig 0.8.1 API?
>
> Thanks very much.
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB