|
Bart Verwilst
2012-11-23, 20:45
Deepak Tiwari
2012-11-23, 23:41
Bart Verwilst
2012-11-24, 13:15
Russell Jurney
2012-11-24, 19:23
Bart Verwilst
2012-11-25, 11:02
Cheolsoo Park
2012-11-25, 14:33
Bart Verwilst
2012-11-25, 20:25
Cheolsoo Park
2012-11-26, 09:45
Bart Verwilst
2012-11-26, 13:19
Bart Verwilst
2012-11-26, 14:33
Bart Verwilst
2012-11-26, 15:50
Bart Verwilst
2012-11-26, 12:48
Bart Verwilst
2012-11-25, 20:14
|
-
LOAD multiple files with globBart Verwilst 2012-11-23, 20:45
Hello,
I have the following files on HDFS: -rw-r--r-- 3 hdfs supergroup 22989179 2012-11-22 11:17 /data/2012/trace_ejb3/2012-01-01.avro -rw-r--r-- 3 hdfs supergroup 240551819 2012-11-22 14:27 /data/2012/trace_ejb3/2012-01-02.avro -rw-r--r-- 3 hdfs supergroup 324464635 2012-11-22 18:28 /data/2012/trace_ejb3/2012-01-03.avro -rw-r--r-- 3 hdfs supergroup 345526418 2012-11-22 21:30 /data/2012/trace_ejb3/2012-01-04.avro -rw-r--r-- 3 hdfs supergroup 351322916 2012-11-23 00:28 /data/2012/trace_ejb3/2012-01-05.avro -rw-r--r-- 3 hdfs supergroup 325953043 2012-11-23 04:32 /data/2012/trace_ejb3/2012-01-06.avro -rw-r--r-- 3 hdfs supergroup 107019156 2012-11-23 05:58 /data/2012/trace_ejb3/2012-01-07.avro -rw-r--r-- 3 hdfs supergroup 46392850 2012-11-23 06:37 /data/2012/trace_ejb3/2012-01-08.avro -rw-r--r-- 3 hdfs supergroup 361970930 2012-11-23 10:06 /data/2012/trace_ejb3/2012-01-09.avro -rw-r--r-- 3 hdfs supergroup 398462505 2012-11-23 13:44 /data/2012/trace_ejb3/2012-01-10.avro -rw-r--r-- 3 hdfs supergroup 400785976 2012-11-23 17:17 /data/2012/trace_ejb3/2012-01-11.avro -rw-r--r-- 3 hdfs supergroup 400027565 2012-11-23 20:43 /data/2012/trace_ejb3/2012-01-12.avro Using Pig 0.10.0-cdh4.1.2, i try to load those files, and describe them. REGISTER 'hdfs:///lib/avro-1.7.2.jar'; REGISTER 'hdfs:///lib/json-simple-1.1.1.jar'; REGISTER 'hdfs:///lib/piggybank.jar'; DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); avro = load '/data/2012/trace_ejb3/2012-01-01.avro' USING AvroStorage(); describe avro; This works, same with 2012-01-02.avro. However, as soon as i want to include multiple files, no dice. avro = load '/data/2012/trace_ejb3/2012-01-{01,02}.avro' USING AvroStorage(); gives me: 2012-11-23 21:41:07,475 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. null Caused by: java.net.URISyntaxException: Illegal character in path at index 30: /data/2012/trace_ejb3/2012-01-{01,02}.avro avro = load '/data/2012/trace_ejb3/2012-01-*.avro' USING AvroStorage(); gives me: Schema for avro unknown. avro = load '/data/2012/trace_ejb3/2012-01-0[12].avro' USING AvroStorage(); also gives me: Caused by: java.net.URISyntaxException: Illegal character in path at index 31: /data/2012/trace_ejb3/2012-01-0[12].avro What am i doing wrong here? According to http://hadoop.apache.org/docs/r0.21.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29 , this should all be acceptable input? Thanks in advance! Kind regards, Bart +
Bart Verwilst 2012-11-23, 20:45
-
Re: LOAD multiple files with globDeepak Tiwari 2012-11-23, 23:41
Hi,
I dont have a system to test it right now, but I have been passing it using under parameter -p and it works. change line to accept parameters like avro = load '$INPUT' USING AvroStorage(); bin/pig -p INPUT="/data/2012/trace_ejb3/2012-**01-0[12].avro" <scriptName> I think if you dont give double quotes then the expansion is done by OS. Please let us know if it doesnt work... On Fri, Nov 23, 2012 at 12:45 PM, Bart Verwilst <[EMAIL PROTECTED]> wrote: > Hello, > > I have the following files on HDFS: > > -rw-r--r-- 3 hdfs supergroup 22989179 2012-11-22 11:17 > /data/2012/trace_ejb3/2012-01-**01.avro > -rw-r--r-- 3 hdfs supergroup 240551819 2012-11-22 14:27 > /data/2012/trace_ejb3/2012-01-**02.avro > -rw-r--r-- 3 hdfs supergroup 324464635 2012-11-22 18:28 > /data/2012/trace_ejb3/2012-01-**03.avro > -rw-r--r-- 3 hdfs supergroup 345526418 2012-11-22 21:30 > /data/2012/trace_ejb3/2012-01-**04.avro > -rw-r--r-- 3 hdfs supergroup 351322916 2012-11-23 00:28 > /data/2012/trace_ejb3/2012-01-**05.avro > -rw-r--r-- 3 hdfs supergroup 325953043 2012-11-23 04:32 > /data/2012/trace_ejb3/2012-01-**06.avro > -rw-r--r-- 3 hdfs supergroup 107019156 2012-11-23 05:58 > /data/2012/trace_ejb3/2012-01-**07.avro > -rw-r--r-- 3 hdfs supergroup 46392850 2012-11-23 06:37 > /data/2012/trace_ejb3/2012-01-**08.avro > -rw-r--r-- 3 hdfs supergroup 361970930 2012-11-23 10:06 > /data/2012/trace_ejb3/2012-01-**09.avro > -rw-r--r-- 3 hdfs supergroup 398462505 2012-11-23 13:44 > /data/2012/trace_ejb3/2012-01-**10.avro > -rw-r--r-- 3 hdfs supergroup 400785976 2012-11-23 17:17 > /data/2012/trace_ejb3/2012-01-**11.avro > -rw-r--r-- 3 hdfs supergroup 400027565 2012-11-23 20:43 > /data/2012/trace_ejb3/2012-01-**12.avro > > Using Pig 0.10.0-cdh4.1.2, i try to load those files, and describe them. > > REGISTER 'hdfs:///lib/avro-1.7.2.jar'; > REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar'; > REGISTER 'hdfs:///lib/piggybank.jar'; > > DEFINE AvroStorage org.apache.pig.piggybank.**storage.avro.AvroStorage(); > > avro = load '/data/2012/trace_ejb3/2012-**01-01.avro' USING AvroStorage(); > > describe avro; > > > This works, same with 2012-01-02.avro. > > However, as soon as i want to include multiple files, no dice. > > avro = load '/data/2012/trace_ejb3/2012-**01-{01,02}.avro' USING > AvroStorage(); > gives me: > 2012-11-23 21:41:07,475 [main] ERROR org.apache.pig.tools.grunt.**Grunt - > ERROR 2999: Unexpected internal error. null > Caused by: java.net.URISyntaxException: Illegal character in path at index > 30: /data/2012/trace_ejb3/2012-01-**{01,02}.avro > > avro = load '/data/2012/trace_ejb3/2012-**01-*.avro' USING AvroStorage(); > gives me: > Schema for avro unknown. > > avro = load '/data/2012/trace_ejb3/2012-**01-0[12].avro' USING > AvroStorage(); > also gives me: > Caused by: java.net.URISyntaxException: Illegal character in path at index > 31: /data/2012/trace_ejb3/2012-01-**0[12].avro > > What am i doing wrong here? According to http://hadoop.apache.org/docs/** > r0.21.0/api/org/apache/hadoop/**fs/FileSystem.html#globStatus%** > 28org.apache.hadoop.fs.Path%29<http://hadoop.apache.org/docs/r0.21.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29>, this should all be acceptable input? > > Thanks in advance! > > Kind regards, > > Bart > +
Deepak Tiwari 2012-11-23, 23:41
-
Re: LOAD multiple files with globBart Verwilst 2012-11-24, 13:15
Hello,
Thanks for your suggestion! I switch my avro variable to avro = load '$INPUT' USING AvroStorage(); However I get the same results this way: $ pig -p INPUT=/data/2012/trace_ejb3/2012-01-02.avro avro-test.pig which: no hbase in (:/usr/lib64/qt-3.3/bin:/usr/java/jdk1.6.0_33/bin/:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local/bin) <snip> avro: {id: long,timestamp: long,latitude: int,longitude: int,speed: int,heading: int,terminalid: int,customerid: chararray,mileage: int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM: (id: long,value: chararray,pkey: chararray)}} $ pig -p INPUT="/data/2012/trace_ejb3/2012-01-0[12].avro" avro-test.pig which: no hbase in (:/usr/lib64/qt-3.3/bin:/usr/java/jdk1.6.0_33/bin/:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local/bin) <snip> 2012-11-24 14:11:17,309 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. null Caused by: java.net.URISyntaxException: Illegal character in path at index 31: /data/2012/trace_ejb3/2012-01-0[12].avro $ pig -p INPUT='/data/2012/trace_ejb3/2012-01-0[12].avro' avro-test.pig which: no hbase in (:/usr/lib64/qt-3.3/bin:/usr/java/jdk1.6.0_33/bin/:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local/bin) <snip> 2012-11-24 14:12:05,085 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. null Details at logfile: /var/lib/hadoop-hdfs/pig_1353762722742.log Caused by: java.net.URISyntaxException: Illegal character in path at index 31: /data/2012/trace_ejb3/2012-01-0[12].avro Deepak Tiwari schreef op 24.11.2012 00:41: > Hi, > > I dont have a system to test it right now, but I have been passing it > using > under parameter -p and it works. > > change line to accept parameters like avro = load '$INPUT' > USING > AvroStorage(); > > bin/pig -p INPUT="/data/2012/trace_ejb3/2012-**01-0[12].avro" > <scriptName> > > I think if you dont give double quotes then the expansion is done by > OS. > > Please let us know if it doesnt work... > > > > On Fri, Nov 23, 2012 at 12:45 PM, Bart Verwilst <[EMAIL PROTECTED]> > wrote: > >> Hello, >> >> I have the following files on HDFS: >> >> -rw-r--r-- 3 hdfs supergroup 22989179 2012-11-22 11:17 >> /data/2012/trace_ejb3/2012-01-**01.avro >> -rw-r--r-- 3 hdfs supergroup 240551819 2012-11-22 14:27 >> /data/2012/trace_ejb3/2012-01-**02.avro >> -rw-r--r-- 3 hdfs supergroup 324464635 2012-11-22 18:28 >> /data/2012/trace_ejb3/2012-01-**03.avro >> -rw-r--r-- 3 hdfs supergroup 345526418 2012-11-22 21:30 >> /data/2012/trace_ejb3/2012-01-**04.avro >> -rw-r--r-- 3 hdfs supergroup 351322916 2012-11-23 00:28 >> /data/2012/trace_ejb3/2012-01-**05.avro >> -rw-r--r-- 3 hdfs supergroup 325953043 2012-11-23 04:32 >> /data/2012/trace_ejb3/2012-01-**06.avro >> -rw-r--r-- 3 hdfs supergroup 107019156 2012-11-23 05:58 >> /data/2012/trace_ejb3/2012-01-**07.avro >> -rw-r--r-- 3 hdfs supergroup 46392850 2012-11-23 06:37 >> /data/2012/trace_ejb3/2012-01-**08.avro >> -rw-r--r-- 3 hdfs supergroup 361970930 2012-11-23 10:06 >> /data/2012/trace_ejb3/2012-01-**09.avro >> -rw-r--r-- 3 hdfs supergroup 398462505 2012-11-23 13:44 >> /data/2012/trace_ejb3/2012-01-**10.avro >> -rw-r--r-- 3 hdfs supergroup 400785976 2012-11-23 17:17 >> /data/2012/trace_ejb3/2012-01-**11.avro >> -rw-r--r-- 3 hdfs supergroup 400027565 2012-11-23 20:43 >> /data/2012/trace_ejb3/2012-01-**12.avro >> >> Using Pig 0.10.0-cdh4.1.2, i try to load those files, and describe >> them. >> >> REGISTER 'hdfs:///lib/avro-1.7.2.jar'; >> REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar'; >> REGISTER 'hdfs:///lib/piggybank.jar'; >> >> DEFINE AvroStorage >> org.apache.pig.piggybank.**storage.avro.AvroStorage(); >> >> avro = load '/data/2012/trace_ejb3/2012-**01-01.avro' USING >> AvroStorage(); >> >> describe avro; >> >> >> This works, same with 2012-01-02.avro. >> >> However, as soon as i want to include multiple files, no dice. +
Bart Verwilst 2012-11-24, 13:15
-
Re: LOAD multiple files with globRussell Jurney 2012-11-24, 19:23
I suspect the problem is AvroStorage, not globbing. Try this with pigstorage.
Russell Jurney twitter.com/rjurney On Nov 24, 2012, at 5:15 AM, Bart Verwilst <[EMAIL PROTECTED]> wrote: > Hello, > > Thanks for your suggestion! > I switch my avro variable to avro = load '$INPUT' USING AvroStorage(); > > However I get the same results this way: > > $ pig -p INPUT=/data/2012/trace_ejb3/2012-01-02.avro avro-test.pig > which: no hbase in (:/usr/lib64/qt-3.3/bin:/usr/java/jdk1.6.0_33/bin/:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local/bin) > <snip> > avro: {id: long,timestamp: long,latitude: int,longitude: int,speed: int,heading: int,terminalid: int,customerid: chararray,mileage: int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM: (id: long,value: chararray,pkey: chararray)}} > > > $ pig -p INPUT="/data/2012/trace_ejb3/2012-01-0[12].avro" avro-test.pig > which: no hbase in (:/usr/lib64/qt-3.3/bin:/usr/java/jdk1.6.0_33/bin/:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local/bin) > <snip> > 2012-11-24 14:11:17,309 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. null > Caused by: java.net.URISyntaxException: Illegal character in path at index 31: /data/2012/trace_ejb3/2012-01-0[12].avro > > > $ pig -p INPUT='/data/2012/trace_ejb3/2012-01-0[12].avro' avro-test.pig > which: no hbase in (:/usr/lib64/qt-3.3/bin:/usr/java/jdk1.6.0_33/bin/:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local/bin) > <snip> > 2012-11-24 14:12:05,085 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. null > Details at logfile: /var/lib/hadoop-hdfs/pig_1353762722742.log > Caused by: java.net.URISyntaxException: Illegal character in path at index 31: /data/2012/trace_ejb3/2012-01-0[12].avro > > > Deepak Tiwari schreef op 24.11.2012 00:41: >> Hi, >> >> I dont have a system to test it right now, but I have been passing it using >> under parameter -p and it works. >> >> change line to accept parameters like avro = load '$INPUT' USING >> AvroStorage(); >> >> bin/pig -p INPUT="/data/2012/trace_ejb3/2012-**01-0[12].avro" <scriptName> >> >> I think if you dont give double quotes then the expansion is done by OS. >> >> Please let us know if it doesnt work... >> >> >> >> On Fri, Nov 23, 2012 at 12:45 PM, Bart Verwilst <[EMAIL PROTECTED]> wrote: >> >>> Hello, >>> >>> I have the following files on HDFS: >>> >>> -rw-r--r-- 3 hdfs supergroup 22989179 2012-11-22 11:17 >>> /data/2012/trace_ejb3/2012-01-**01.avro >>> -rw-r--r-- 3 hdfs supergroup 240551819 2012-11-22 14:27 >>> /data/2012/trace_ejb3/2012-01-**02.avro >>> -rw-r--r-- 3 hdfs supergroup 324464635 2012-11-22 18:28 >>> /data/2012/trace_ejb3/2012-01-**03.avro >>> -rw-r--r-- 3 hdfs supergroup 345526418 2012-11-22 21:30 >>> /data/2012/trace_ejb3/2012-01-**04.avro >>> -rw-r--r-- 3 hdfs supergroup 351322916 2012-11-23 00:28 >>> /data/2012/trace_ejb3/2012-01-**05.avro >>> -rw-r--r-- 3 hdfs supergroup 325953043 2012-11-23 04:32 >>> /data/2012/trace_ejb3/2012-01-**06.avro >>> -rw-r--r-- 3 hdfs supergroup 107019156 2012-11-23 05:58 >>> /data/2012/trace_ejb3/2012-01-**07.avro >>> -rw-r--r-- 3 hdfs supergroup 46392850 2012-11-23 06:37 >>> /data/2012/trace_ejb3/2012-01-**08.avro >>> -rw-r--r-- 3 hdfs supergroup 361970930 2012-11-23 10:06 >>> /data/2012/trace_ejb3/2012-01-**09.avro >>> -rw-r--r-- 3 hdfs supergroup 398462505 2012-11-23 13:44 >>> /data/2012/trace_ejb3/2012-01-**10.avro >>> -rw-r--r-- 3 hdfs supergroup 400785976 2012-11-23 17:17 >>> /data/2012/trace_ejb3/2012-01-**11.avro >>> -rw-r--r-- 3 hdfs supergroup 400027565 2012-11-23 20:43 >>> /data/2012/trace_ejb3/2012-01-**12.avro >>> >>> Using Pig 0.10.0-cdh4.1.2, i try to load those files, and describe them. >>> >>> REGISTER 'hdfs:///lib/avro-1.7.2.jar'; >>> REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar'; >>> REGISTER 'hdfs:///lib/piggybank.jar'; >>> >>> DEFINE AvroStorage org.apache.pig.piggybank.**storage.avro.AvroStorage(); +
Russell Jurney 2012-11-24, 19:23
-
Re: LOAD multiple files with globBart Verwilst 2012-11-25, 11:02
Hi,
I've tried loading a csv with PigStorage(), getting this: txt = load '/import.mysql/trace_ejb3_2011/part-m-00000' USING PigStorage(','); describe txt; Schema for txt unknown. Maybe this is because of it being a csv, so a schema is hard to figure out.. Any other suggestions? Our whole hadoop setup is built around being able to selectively load avro files to run our jobs on, if this doesn't work then we're pretty much screwed.. :) Thanks in advance! Bart Russell Jurney schreef op 24.11.2012 20:23: > I suspect the problem is AvroStorage, not globbing. Try this with > pigstorage. > > Russell Jurney twitter.com/rjurney > > > On Nov 24, 2012, at 5:15 AM, Bart Verwilst <[EMAIL PROTECTED]> wrote: > >> Hello, >> >> Thanks for your suggestion! >> I switch my avro variable to avro = load '$INPUT' USING >> AvroStorage(); >> >> However I get the same results this way: >> >> $ pig -p INPUT=/data/2012/trace_ejb3/2012-01-02.avro avro-test.pig >> which: no hbase in >> (:/usr/lib64/qt-3.3/bin:/usr/java/jdk1.6.0_33/bin/:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local/bin) >> <snip> >> avro: {id: long,timestamp: long,latitude: int,longitude: int,speed: >> int,heading: int,terminalid: int,customerid: chararray,mileage: >> int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM: >> (id: long,value: chararray,pkey: chararray)}} >> >> >> $ pig -p INPUT="/data/2012/trace_ejb3/2012-01-0[12].avro" >> avro-test.pig >> which: no hbase in >> (:/usr/lib64/qt-3.3/bin:/usr/java/jdk1.6.0_33/bin/:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local/bin) >> <snip> >> 2012-11-24 14:11:17,309 [main] ERROR >> org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal >> error. null >> Caused by: java.net.URISyntaxException: Illegal character in path at >> index 31: /data/2012/trace_ejb3/2012-01-0[12].avro >> >> >> $ pig -p INPUT='/data/2012/trace_ejb3/2012-01-0[12].avro' >> avro-test.pig >> which: no hbase in >> (:/usr/lib64/qt-3.3/bin:/usr/java/jdk1.6.0_33/bin/:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local/bin) >> <snip> >> 2012-11-24 14:12:05,085 [main] ERROR >> org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal >> error. null >> Details at logfile: /var/lib/hadoop-hdfs/pig_1353762722742.log >> Caused by: java.net.URISyntaxException: Illegal character in path at >> index 31: /data/2012/trace_ejb3/2012-01-0[12].avro >> >> >> Deepak Tiwari schreef op 24.11.2012 00:41: >>> Hi, >>> >>> I dont have a system to test it right now, but I have been passing >>> it using >>> under parameter -p and it works. >>> >>> change line to accept parameters like avro = load '$INPUT' >>> USING >>> AvroStorage(); >>> >>> bin/pig -p INPUT="/data/2012/trace_ejb3/2012-**01-0[12].avro" >>> <scriptName> >>> >>> I think if you dont give double quotes then the expansion is done >>> by OS. >>> >>> Please let us know if it doesnt work... >>> >>> >>> >>> On Fri, Nov 23, 2012 at 12:45 PM, Bart Verwilst <[EMAIL PROTECTED]> >>> wrote: >>> >>>> Hello, >>>> >>>> I have the following files on HDFS: >>>> >>>> -rw-r--r-- 3 hdfs supergroup 22989179 2012-11-22 11:17 >>>> /data/2012/trace_ejb3/2012-01-**01.avro >>>> -rw-r--r-- 3 hdfs supergroup 240551819 2012-11-22 14:27 >>>> /data/2012/trace_ejb3/2012-01-**02.avro >>>> -rw-r--r-- 3 hdfs supergroup 324464635 2012-11-22 18:28 >>>> /data/2012/trace_ejb3/2012-01-**03.avro >>>> -rw-r--r-- 3 hdfs supergroup 345526418 2012-11-22 21:30 >>>> /data/2012/trace_ejb3/2012-01-**04.avro >>>> -rw-r--r-- 3 hdfs supergroup 351322916 2012-11-23 00:28 >>>> /data/2012/trace_ejb3/2012-01-**05.avro >>>> -rw-r--r-- 3 hdfs supergroup 325953043 2012-11-23 04:32 >>>> /data/2012/trace_ejb3/2012-01-**06.avro >>>> -rw-r--r-- 3 hdfs supergroup 107019156 2012-11-23 05:58 >>>> /data/2012/trace_ejb3/2012-01-**07.avro >>>> -rw-r--r-- 3 hdfs supergroup 46392850 2012-11-23 06:37 >>>> /data/2012/trace_ejb3/2012-01-**08.avro +
Bart Verwilst 2012-11-25, 11:02
-
Re: LOAD multiple files with globCheolsoo Park 2012-11-25, 14:33
Hi Bart,
avro = load '/data/2012/trace_ejb3/2012-**01-*.avro' USING AvroStorage(); gives me: Schema for avro unknown. This should work. The error that you're getting is not from AvroStorage but PigServer. grep -r "Schema for .* unknown" * src/org/apache/pig/PigServer.java: System.out.println("Schema for " + alias + " unknown."); ... It looks like that you have an error in your Pig script. Can you please provide your Pig script and the schema of your avro files that reproduce the error? Thanks, Cheolsoo On Sun, Nov 25, 2012 at 1:02 AM, Bart Verwilst <[EMAIL PROTECTED]> wrote: > Hi, > > I've tried loading a csv with PigStorage(), getting this: > > > txt = load '/import.mysql/trace_ejb3_**2011/part-m-00000' USING > PigStorage(','); > describe txt; > > Schema for txt unknown. > > Maybe this is because of it being a csv, so a schema is hard to figure > out.. > > Any other suggestions? Our whole hadoop setup is built around being able > to selectively load avro files to run our jobs on, if this doesn't work > then we're pretty much screwed.. :) > > Thanks in advance! > > Bart > > Russell Jurney schreef op 24.11.2012 20:23: > > I suspect the problem is AvroStorage, not globbing. Try this with >> pigstorage. >> >> Russell Jurney twitter.com/rjurney >> >> >> On Nov 24, 2012, at 5:15 AM, Bart Verwilst <[EMAIL PROTECTED]> wrote: >> >> Hello, >>> >>> Thanks for your suggestion! >>> I switch my avro variable to avro = load '$INPUT' USING AvroStorage(); >>> >>> However I get the same results this way: >>> >>> $ pig -p INPUT=/data/2012/trace_ejb3/**2012-01-02.avro avro-test.pig >>> which: no hbase in (:/usr/lib64/qt-3.3/bin:/usr/** >>> java/jdk1.6.0_33/bin/:/usr/**local/bin:/bin:/usr/bin:/usr/** >>> local/sbin:/usr/sbin:/sbin:/**usr/local/bin) >>> <snip> >>> avro: {id: long,timestamp: long,latitude: int,longitude: int,speed: >>> int,heading: int,terminalid: int,customerid: chararray,mileage: >>> int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM: (id: >>> long,value: chararray,pkey: chararray)}} >>> >>> >>> $ pig -p INPUT="/data/2012/trace_ejb3/**2012-01-0[12].avro" >>> avro-test.pig >>> which: no hbase in (:/usr/lib64/qt-3.3/bin:/usr/** >>> java/jdk1.6.0_33/bin/:/usr/**local/bin:/bin:/usr/bin:/usr/** >>> local/sbin:/usr/sbin:/sbin:/**usr/local/bin) >>> <snip> >>> 2012-11-24 14:11:17,309 [main] ERROR org.apache.pig.tools.grunt.**Grunt >>> - ERROR 2999: Unexpected internal error. null >>> Caused by: java.net.URISyntaxException: Illegal character in path at >>> index 31: /data/2012/trace_ejb3/2012-01-**0[12].avro >>> >>> >>> $ pig -p INPUT='/data/2012/trace_ejb3/**2012-01-0[12].avro' >>> avro-test.pig >>> which: no hbase in (:/usr/lib64/qt-3.3/bin:/usr/** >>> java/jdk1.6.0_33/bin/:/usr/**local/bin:/bin:/usr/bin:/usr/** >>> local/sbin:/usr/sbin:/sbin:/**usr/local/bin) >>> <snip> >>> 2012-11-24 14:12:05,085 [main] ERROR org.apache.pig.tools.grunt.**Grunt >>> - ERROR 2999: Unexpected internal error. null >>> Details at logfile: /var/lib/hadoop-hdfs/pig_**1353762722742.log >>> Caused by: java.net.URISyntaxException: Illegal character in path at >>> index 31: /data/2012/trace_ejb3/2012-01-**0[12].avro >>> >>> >>> Deepak Tiwari schreef op 24.11.2012 00:41: >>> >>>> Hi, >>>> >>>> I dont have a system to test it right now, but I have been passing it >>>> using >>>> under parameter -p and it works. >>>> >>>> change line to accept parameters like avro = load '$INPUT' >>>> USING >>>> AvroStorage(); >>>> >>>> bin/pig -p INPUT="/data/2012/trace_ejb3/**2012-**01-0[12].avro" >>>> <scriptName> >>>> >>>> I think if you dont give double quotes then the expansion is done by OS. >>>> >>>> Please let us know if it doesnt work... >>>> >>>> >>>> >>>> On Fri, Nov 23, 2012 at 12:45 PM, Bart Verwilst <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>> Hello, >>>>> >>>>> I have the following files on HDFS: >>>>> >>>>> -rw-r--r-- 3 hdfs supergroup 22989179 2012-11-22 11:17 >>>>> /data/2012/trace_ejb3/2012-01-****01.avro >>>>> -rw-r--r-- 3 hdfs supergroup 240551819 2012-11-22 14:27 +
Cheolsoo Park 2012-11-25, 14:33
-
Re: LOAD multiple files with globBart Verwilst 2012-11-25, 20:25
Just tried this:
---------------------------------------------------- REGISTER 'hdfs:///lib/avro-1.7.2.jar'; REGISTER 'hdfs:///lib/json-simple-1.1.1.jar'; REGISTER 'hdfs:///lib/piggybank.jar'; DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); avro = load '/data/2012/trace_ejb3/2012-01-0*.avro' USING AvroStorage(); groups = group avro by tracetype; dump groups; ---------------------------------------------------- gave me: <file avro-test.pig, line 10, column 23> Invalid field projection. Projected field [tracetype] does not exist. Pig Stack Trace --------------- ERROR 1025: <file avro-test.pig, line 10, column 23> Invalid field projection. Projected field [tracetype] does not exist. org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias groups at org.apache.pig.PigServer.openIterator(PigServer.java:862) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:682) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.java:555) at org.apache.pig.Main.main(Main.java:111) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:208) Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store alias groups at org.apache.pig.PigServer.storeEx(PigServer.java:961) at org.apache.pig.PigServer.store(PigServer.java:924) at org.apache.pig.PigServer.openIterator(PigServer.java:837) ... 12 more Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: <file avro-test.pig, line 10, column 23> Invalid field projection. Projected field [tracetype] does not exist. at org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:183) at org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:166) at org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53) at org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:207) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50) at org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:101) at org.apache.pig.newplan.logical.relational.LOCogroup.accept(LOCogroup.java:235) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1621) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1616) at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1339) at org.apache.pig.PigServer.storeEx(PigServer.java:956) ... 14 more =============================================================================== Maybe globbing with [] doesnt work, but wildcard works? No idea why i get the error above though.. Kind regards, Bart Cheolsoo Park schreef op 25.11.2012 15:33: > Hi Bart, > > avro = load '/data/2012/trace_ejb3/2012-**01-*.avro' USING > AvroStorage(); > gives me: > Schema for avro unknown. > > This should work. The error that you're getting is not from > AvroStorage but > PigServer. > > grep -r "Schema for .* unknown" * > src/org/apache/pig/PigServer.java: > System.out.println("Schema for " + alias + " unknown."); > ... > > It looks like that you have an error in your Pig script. Can you +
Bart Verwilst 2012-11-25, 20:25
-
Re: LOAD multiple files with globCheolsoo Park 2012-11-26, 09:45
Hi,
>> Invalid field projection. Projected field [tracetype] does not exist. The error indicates that the "tracetype" doesn't exist in the Pig schema of the relation "avro". What AvroStorage does is to automatically convert Avro schema to Pig schema during the load. Although you have "tracetype" in your Avro schema, "tracetype" doesn't exist in the generated Pig schema for whatever reason. Can you please try to "describe avro"? You can replace group and dump commands with describe in your Pig script. This will show you what the Pig schema of "avro" is. If "tracetype" indeed doesn't exist, you have to find out why it doesn't. It could be because the schema of .avro files is not the same or because there is a bug in AvroStorage, etc. >> Maybe globbing with [] doesnt work, but wildcard works? You're right. AvroStorage internally uses Hadoop path globing, and Hadoop path globing doesn't support '[ ]'. But the above error (Projected field [tracetype] does not exist) is not because of this. URISyntaxException is what you will get because of '[ ]'. Thanks, Cheolsoo On Sun, Nov 25, 2012 at 10:25 AM, Bart Verwilst <[EMAIL PROTECTED]> wrote: > Just tried this: > > > ------------------------------**---------------------- > REGISTER 'hdfs:///lib/avro-1.7.2.jar'; > REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar'; > REGISTER 'hdfs:///lib/piggybank.jar'; > > DEFINE AvroStorage org.apache.pig.piggybank.**storage.avro.AvroStorage(); > > avro = load '/data/2012/trace_ejb3/2012-**01-0*.avro' USING AvroStorage(); > > groups = group avro by tracetype; > > dump groups; > ------------------------------**---------------------- > > gave me: > > <file avro-test.pig, line 10, column 23> Invalid field projection. > Projected field [tracetype] does not exist. > > Pig Stack Trace > --------------- > ERROR 1025: > <file avro-test.pig, line 10, column 23> Invalid field projection. > Projected field [tracetype] does not exist. > > org.apache.pig.impl.**logicalLayer.**FrontendException: ERROR 1066: > Unable to open iterator for alias groups > at org.apache.pig.PigServer.**openIterator(PigServer.java:**862) > at org.apache.pig.tools.grunt.**GruntParser.processDump(** > GruntParser.java:682) > at org.apache.pig.tools.**pigscript.parser.** > PigScriptParser.parse(**PigScriptParser.java:303) > at org.apache.pig.tools.grunt.**GruntParser.parseStopOnError(** > GruntParser.java:189) > at org.apache.pig.tools.grunt.**GruntParser.parseStopOnError(** > GruntParser.java:165) > at org.apache.pig.tools.grunt.**Grunt.exec(Grunt.java:84) > at org.apache.pig.Main.run(Main.**java:555) > at org.apache.pig.Main.main(Main.**java:111) > at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native Method) > at sun.reflect.**NativeMethodAccessorImpl.**invoke(** > NativeMethodAccessorImpl.java:**39) > at sun.reflect.**DelegatingMethodAccessorImpl.**invoke(** > DelegatingMethodAccessorImpl.**java:25) > at java.lang.reflect.Method.**invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.**main(RunJar.java:208) > Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store alias > groups > at org.apache.pig.PigServer.**storeEx(PigServer.java:961) > at org.apache.pig.PigServer.**store(PigServer.java:924) > at org.apache.pig.PigServer.**openIterator(PigServer.java:**837) > ... 12 more > Caused by: org.apache.pig.impl.plan.**PlanValidationException: ERROR 1025: > <file avro-test.pig, line 10, column 23> Invalid field projection. > Projected field [tracetype] does not exist. > at org.apache.pig.newplan.**logical.expression.** > ProjectExpression.findColNum(**ProjectExpression.java:183) > at org.apache.pig.newplan.**logical.expression.** > ProjectExpression.**setColumnNumberFromAlias(**ProjectExpression.java:166) > at org.apache.pig.newplan.**logical.visitor.** > ColumnAliasConversionVisitor$**1.visit(**ColumnAliasConversionVisitor.** +
Cheolsoo Park 2012-11-26, 09:45
-
Re: LOAD multiple files with globBart Verwilst 2012-11-26, 13:19
14:16:08 centos6-hadoop-hishiru ~ $ cat avro-test.pig
REGISTER 'hdfs:///lib/avro-1.7.2.jar'; REGISTER 'hdfs:///lib/json-simple-1.1.1.jar'; REGISTER 'hdfs:///lib/piggybank.jar'; DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); avro = load '/test/*' USING AvroStorage(); describe avro; 14:16:09 centos6-hadoop-hishiru ~ $ pig avro-test.pig Schema for avro unknown. 14:16:17 centos6-hadoop-hishiru ~ $ vim avro-test.pig 14:16:25 centos6-hadoop-hishiru ~ $ cat avro-test.pig REGISTER 'hdfs:///lib/avro-1.7.2.jar'; REGISTER 'hdfs:///lib/json-simple-1.1.1.jar'; REGISTER 'hdfs:///lib/piggybank.jar'; DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); avro = load '/test/2012-11-25.avro' USING AvroStorage(); describe avro; 14:16:30 centos6-hadoop-hishiru ~ $ pig avro-test.pig avro: {id: long,timestamp: long,latitude: int,longitude: int,speed: int,heading: int,terminalid: int,customerid: chararray,mileage: int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM: (id: long,value: chararray,pkey: chararray)}} 14:16:55 centos6-hadoop-hishiru ~ $ hadoop fs -ls /test/ Found 1 items -rw-r--r-- 3 hdfs supergroup 63140500 2012-11-26 14:13 /test/2012-11-25.avro Cheolsoo Park schreef op 26.11.2012 10:45: > Hi, > >>> Invalid field projection. Projected field [tracetype] does not >>> exist. > > The error indicates that the "tracetype" doesn't exist in the Pig > schema of > the relation "avro". What AvroStorage does is to automatically > convert Avro > schema to Pig schema during the load. Although you have "tracetype" > in your > Avro schema, "tracetype" doesn't exist in the generated Pig schema > for > whatever reason. > > Can you please try to "describe avro"? You can replace group and dump > commands with describe in your Pig script. This will show you what > the Pig > schema of "avro" is. If "tracetype" indeed doesn't exist, you have to > find > out why it doesn't. It could be because the schema of .avro files is > not > the same or because there is a bug in AvroStorage, etc. > >>> Maybe globbing with [] doesnt work, but wildcard works? > > You're right. AvroStorage internally uses Hadoop path globing, and > Hadoop > path globing doesn't support '[ ]'. But the above error (Projected > field > [tracetype] does not exist) is not because of this. > URISyntaxException is > what you will get because of '[ ]'. > > Thanks, > Cheolsoo > > > > On Sun, Nov 25, 2012 at 10:25 AM, Bart Verwilst <[EMAIL PROTECTED]> > wrote: > >> Just tried this: >> >> >> ------------------------------**---------------------- >> REGISTER 'hdfs:///lib/avro-1.7.2.jar'; >> REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar'; >> REGISTER 'hdfs:///lib/piggybank.jar'; >> >> DEFINE AvroStorage >> org.apache.pig.piggybank.**storage.avro.AvroStorage(); >> >> avro = load '/data/2012/trace_ejb3/2012-**01-0*.avro' USING >> AvroStorage(); >> >> groups = group avro by tracetype; >> >> dump groups; >> ------------------------------**---------------------- >> >> gave me: >> >> <file avro-test.pig, line 10, column 23> Invalid field projection. >> Projected field [tracetype] does not exist. >> >> Pig Stack Trace >> --------------- >> ERROR 1025: >> <file avro-test.pig, line 10, column 23> Invalid field projection. >> Projected field [tracetype] does not exist. >> >> org.apache.pig.impl.**logicalLayer.**FrontendException: ERROR 1066: >> Unable to open iterator for alias groups >> at >> org.apache.pig.PigServer.**openIterator(PigServer.java:**862) >> at org.apache.pig.tools.grunt.**GruntParser.processDump(** >> GruntParser.java:682) >> at org.apache.pig.tools.**pigscript.parser.** >> PigScriptParser.parse(**PigScriptParser.java:303) >> at >> org.apache.pig.tools.grunt.**GruntParser.parseStopOnError(** >> GruntParser.java:189) >> at >> org.apache.pig.tools.grunt.**GruntParser.parseStopOnError(** >> GruntParser.java:165) >> at org.apache.pig.tools.grunt.**Grunt.exec(Grunt.java:84) +
Bart Verwilst 2012-11-26, 13:19
-
Re: LOAD multiple files with globBart Verwilst 2012-11-26, 14:33
To answer myself, could this be part of the solution? :
https://issues.apache.org/jira/browse/PIG-2492 Guess I'll have to wait for 0.11 then? Bart Verwilst schreef op 26.11.2012 14:19: > 14:16:08 centos6-hadoop-hishiru ~ $ cat avro-test.pig > REGISTER 'hdfs:///lib/avro-1.7.2.jar'; > REGISTER 'hdfs:///lib/json-simple-1.1.1.jar'; > REGISTER 'hdfs:///lib/piggybank.jar'; > > DEFINE AvroStorage > org.apache.pig.piggybank.storage.avro.AvroStorage(); > avro = load '/test/*' USING AvroStorage(); > describe avro; > > 14:16:09 centos6-hadoop-hishiru ~ $ pig avro-test.pig > Schema for avro unknown. > > 14:16:17 centos6-hadoop-hishiru ~ $ vim avro-test.pig > > 14:16:25 centos6-hadoop-hishiru ~ $ cat avro-test.pig > REGISTER 'hdfs:///lib/avro-1.7.2.jar'; > REGISTER 'hdfs:///lib/json-simple-1.1.1.jar'; > REGISTER 'hdfs:///lib/piggybank.jar'; > > DEFINE AvroStorage > org.apache.pig.piggybank.storage.avro.AvroStorage(); > avro = load '/test/2012-11-25.avro' USING AvroStorage(); > describe avro; > > 14:16:30 centos6-hadoop-hishiru ~ $ pig avro-test.pig > avro: {id: long,timestamp: long,latitude: int,longitude: int,speed: > int,heading: int,terminalid: int,customerid: chararray,mileage: > int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM: > (id: long,value: chararray,pkey: chararray)}} > > 14:16:55 centos6-hadoop-hishiru ~ $ hadoop fs -ls /test/ > Found 1 items > -rw-r--r-- 3 hdfs supergroup 63140500 2012-11-26 14:13 > /test/2012-11-25.avro > > Cheolsoo Park schreef op 26.11.2012 10:45: >> Hi, >> >>>> Invalid field projection. Projected field [tracetype] does not >>>> exist. >> >> The error indicates that the "tracetype" doesn't exist in the Pig >> schema of >> the relation "avro". What AvroStorage does is to automatically >> convert Avro >> schema to Pig schema during the load. Although you have "tracetype" >> in your >> Avro schema, "tracetype" doesn't exist in the generated Pig schema >> for >> whatever reason. >> >> Can you please try to "describe avro"? You can replace group and >> dump >> commands with describe in your Pig script. This will show you what >> the Pig >> schema of "avro" is. If "tracetype" indeed doesn't exist, you have >> to find >> out why it doesn't. It could be because the schema of .avro files is >> not >> the same or because there is a bug in AvroStorage, etc. >> >>>> Maybe globbing with [] doesnt work, but wildcard works? >> >> You're right. AvroStorage internally uses Hadoop path globing, and >> Hadoop >> path globing doesn't support '[ ]'. But the above error (Projected >> field >> [tracetype] does not exist) is not because of this. >> URISyntaxException is >> what you will get because of '[ ]'. >> >> Thanks, >> Cheolsoo >> >> >> >> On Sun, Nov 25, 2012 at 10:25 AM, Bart Verwilst <[EMAIL PROTECTED]> >> wrote: >> >>> Just tried this: >>> >>> >>> ------------------------------**---------------------- >>> REGISTER 'hdfs:///lib/avro-1.7.2.jar'; >>> REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar'; >>> REGISTER 'hdfs:///lib/piggybank.jar'; >>> >>> DEFINE AvroStorage >>> org.apache.pig.piggybank.**storage.avro.AvroStorage(); >>> >>> avro = load '/data/2012/trace_ejb3/2012-**01-0*.avro' USING >>> AvroStorage(); >>> >>> groups = group avro by tracetype; >>> >>> dump groups; >>> ------------------------------**---------------------- >>> >>> gave me: >>> >>> <file avro-test.pig, line 10, column 23> Invalid field projection. >>> Projected field [tracetype] does not exist. >>> >>> Pig Stack Trace >>> --------------- >>> ERROR 1025: >>> <file avro-test.pig, line 10, column 23> Invalid field projection. >>> Projected field [tracetype] does not exist. >>> >>> org.apache.pig.impl.**logicalLayer.**FrontendException: ERROR 1066: >>> Unable to open iterator for alias groups >>> at >>> org.apache.pig.PigServer.**openIterator(PigServer.java:**862) >>> at org.apache.pig.tools.grunt.**GruntParser.processDump(** >>> GruntParser.java:682) >>> at org.apache.pig.tools.**pigscript.parser.** +
Bart Verwilst 2012-11-26, 14:33
-
Re: LOAD multiple files with globBart Verwilst 2012-11-26, 15:50
To answer myself again, I compiled Pig 0.11 and Piggybank, and it's
working very well now, globbing seems to be fully supported! Bart Verwilst schreef op 26.11.2012 15:33: > To answer myself, could this be part of the solution? : > > https://issues.apache.org/jira/browse/PIG-2492 > > Guess I'll have to wait for 0.11 then? > > Bart Verwilst schreef op 26.11.2012 14:19: >> 14:16:08 centos6-hadoop-hishiru ~ $ cat avro-test.pig >> REGISTER 'hdfs:///lib/avro-1.7.2.jar'; >> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar'; >> REGISTER 'hdfs:///lib/piggybank.jar'; >> >> DEFINE AvroStorage >> org.apache.pig.piggybank.storage.avro.AvroStorage(); >> avro = load '/test/*' USING AvroStorage(); >> describe avro; >> >> 14:16:09 centos6-hadoop-hishiru ~ $ pig avro-test.pig >> Schema for avro unknown. >> >> 14:16:17 centos6-hadoop-hishiru ~ $ vim avro-test.pig >> >> 14:16:25 centos6-hadoop-hishiru ~ $ cat avro-test.pig >> REGISTER 'hdfs:///lib/avro-1.7.2.jar'; >> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar'; >> REGISTER 'hdfs:///lib/piggybank.jar'; >> >> DEFINE AvroStorage >> org.apache.pig.piggybank.storage.avro.AvroStorage(); >> avro = load '/test/2012-11-25.avro' USING AvroStorage(); >> describe avro; >> >> 14:16:30 centos6-hadoop-hishiru ~ $ pig avro-test.pig >> avro: {id: long,timestamp: long,latitude: int,longitude: int,speed: >> int,heading: int,terminalid: int,customerid: chararray,mileage: >> int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM: >> (id: long,value: chararray,pkey: chararray)}} >> >> 14:16:55 centos6-hadoop-hishiru ~ $ hadoop fs -ls /test/ >> Found 1 items >> -rw-r--r-- 3 hdfs supergroup 63140500 2012-11-26 14:13 >> /test/2012-11-25.avro >> >> Cheolsoo Park schreef op 26.11.2012 10:45: >>> Hi, >>> >>>>> Invalid field projection. Projected field [tracetype] does not >>>>> exist. >>> >>> The error indicates that the "tracetype" doesn't exist in the Pig >>> schema of >>> the relation "avro". What AvroStorage does is to automatically >>> convert Avro >>> schema to Pig schema during the load. Although you have "tracetype" >>> in your >>> Avro schema, "tracetype" doesn't exist in the generated Pig schema >>> for >>> whatever reason. >>> >>> Can you please try to "describe avro"? You can replace group and >>> dump >>> commands with describe in your Pig script. This will show you what >>> the Pig >>> schema of "avro" is. If "tracetype" indeed doesn't exist, you have >>> to find >>> out why it doesn't. It could be because the schema of .avro files >>> is not >>> the same or because there is a bug in AvroStorage, etc. >>> >>>>> Maybe globbing with [] doesnt work, but wildcard works? >>> >>> You're right. AvroStorage internally uses Hadoop path globing, and >>> Hadoop >>> path globing doesn't support '[ ]'. But the above error (Projected >>> field >>> [tracetype] does not exist) is not because of this. >>> URISyntaxException is >>> what you will get because of '[ ]'. >>> >>> Thanks, >>> Cheolsoo >>> >>> >>> >>> On Sun, Nov 25, 2012 at 10:25 AM, Bart Verwilst <[EMAIL PROTECTED]> >>> wrote: >>> >>>> Just tried this: >>>> >>>> >>>> ------------------------------**---------------------- >>>> REGISTER 'hdfs:///lib/avro-1.7.2.jar'; >>>> REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar'; >>>> REGISTER 'hdfs:///lib/piggybank.jar'; >>>> >>>> DEFINE AvroStorage >>>> org.apache.pig.piggybank.**storage.avro.AvroStorage(); >>>> >>>> avro = load '/data/2012/trace_ejb3/2012-**01-0*.avro' USING >>>> AvroStorage(); >>>> >>>> groups = group avro by tracetype; >>>> >>>> dump groups; >>>> ------------------------------**---------------------- >>>> >>>> gave me: >>>> >>>> <file avro-test.pig, line 10, column 23> Invalid field projection. >>>> Projected field [tracetype] does not exist. >>>> >>>> Pig Stack Trace >>>> --------------- >>>> ERROR 1025: >>>> <file avro-test.pig, line 10, column 23> Invalid field projection. >>>> Projected field [tracetype] does not exist. >>>> >>>> org.apache.pig.impl.**logicalLayer.**FrontendException: ERROR +
Bart Verwilst 2012-11-26, 15:50
-
Re: LOAD multiple files with globBart Verwilst 2012-11-26, 12:48
Hi Cheolsoo,
Describe shows me: avro: {id: long,timestamp: long,latitude: int,longitude: int,speed: int,heading: int,terminalid: int,customerid: chararray,mileage: int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM: (id: long,value: chararray,pkey: chararray)}} ( tracetype is there.. ) So tracetype should work.. Also tried avro.tracetype, trace.tracetype, but that didn't help.. Still, i've gotten us a bit sidetracked by this, since the issue was that with wildcard globbing, i get "Schema for avro unknown." :) Kind regards, Bart Cheolsoo Park schreef op 26.11.2012 10:45: > Hi, > >>> Invalid field projection. Projected field [tracetype] does not >>> exist. > > The error indicates that the "tracetype" doesn't exist in the Pig > schema of > the relation "avro". What AvroStorage does is to automatically > convert Avro > schema to Pig schema during the load. Although you have "tracetype" > in your > Avro schema, "tracetype" doesn't exist in the generated Pig schema > for > whatever reason. > > Can you please try to "describe avro"? You can replace group and dump > commands with describe in your Pig script. This will show you what > the Pig > schema of "avro" is. If "tracetype" indeed doesn't exist, you have to > find > out why it doesn't. It could be because the schema of .avro files is > not > the same or because there is a bug in AvroStorage, etc. > >>> Maybe globbing with [] doesnt work, but wildcard works? > > You're right. AvroStorage internally uses Hadoop path globing, and > Hadoop > path globing doesn't support '[ ]'. But the above error (Projected > field > [tracetype] does not exist) is not because of this. > URISyntaxException is > what you will get because of '[ ]'. > > Thanks, > Cheolsoo > > > > On Sun, Nov 25, 2012 at 10:25 AM, Bart Verwilst <[EMAIL PROTECTED]> > wrote: > >> Just tried this: >> >> >> ------------------------------**---------------------- >> REGISTER 'hdfs:///lib/avro-1.7.2.jar'; >> REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar'; >> REGISTER 'hdfs:///lib/piggybank.jar'; >> >> DEFINE AvroStorage >> org.apache.pig.piggybank.**storage.avro.AvroStorage(); >> >> avro = load '/data/2012/trace_ejb3/2012-**01-0*.avro' USING >> AvroStorage(); >> >> groups = group avro by tracetype; >> >> dump groups; >> ------------------------------**---------------------- >> >> gave me: >> >> <file avro-test.pig, line 10, column 23> Invalid field projection. >> Projected field [tracetype] does not exist. >> >> Pig Stack Trace >> --------------- >> ERROR 1025: >> <file avro-test.pig, line 10, column 23> Invalid field projection. >> Projected field [tracetype] does not exist. >> >> org.apache.pig.impl.**logicalLayer.**FrontendException: ERROR 1066: >> Unable to open iterator for alias groups >> at >> org.apache.pig.PigServer.**openIterator(PigServer.java:**862) >> at org.apache.pig.tools.grunt.**GruntParser.processDump(** >> GruntParser.java:682) >> at org.apache.pig.tools.**pigscript.parser.** >> PigScriptParser.parse(**PigScriptParser.java:303) >> at >> org.apache.pig.tools.grunt.**GruntParser.parseStopOnError(** >> GruntParser.java:189) >> at >> org.apache.pig.tools.grunt.**GruntParser.parseStopOnError(** >> GruntParser.java:165) >> at org.apache.pig.tools.grunt.**Grunt.exec(Grunt.java:84) >> at org.apache.pig.Main.run(Main.**java:555) >> at org.apache.pig.Main.main(Main.**java:111) >> at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native >> Method) >> at sun.reflect.**NativeMethodAccessorImpl.**invoke(** >> NativeMethodAccessorImpl.java:**39) >> at sun.reflect.**DelegatingMethodAccessorImpl.**invoke(** >> DelegatingMethodAccessorImpl.**java:25) >> at java.lang.reflect.Method.**invoke(Method.java:597) >> at org.apache.hadoop.util.RunJar.**main(RunJar.java:208) >> Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store >> alias >> groups >> at org.apache.pig.PigServer.**storeEx(PigServer.java:961) +
Bart Verwilst 2012-11-26, 12:48
-
Re: LOAD multiple files with globBart Verwilst 2012-11-25, 20:14
Hello,
The schema is displayed by describe when i run it like this: -------------------------------------------- REGISTER 'hdfs:///lib/avro-1.7.2.jar'; REGISTER 'hdfs:///lib/json-simple-1.1.1.jar'; REGISTER 'hdfs:///lib/piggybank.jar'; DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); avro = load '/data/2012/trace_ejb3/2012-01-02.avro' USING AvroStorage(); describe avro; --------------------------------------------- $ pig avro-test.pig <snip> avro: {id: long,timestamp: long,latitude: int,longitude: int,speed: int,heading: int,terminalid: int,customerid: chararray,mileage: int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM: (id: long,value: chararray,pkey: chararray)}} 21:08:46 centos6-hadoop-hishiru ~ $ The actual schema as used by Python to create those files is: { "type": "record", "name": "trace", "namespace": "asp", "fields": [ { "name": "id" , "type": "long" }, { "name": "timestamp" , "type": "long" }, { "name": "latitude", "type": ["int","null"] }, { "name": "longitude", "type": ["int","null"] }, { "name": "speed", "type": ["int","null"] }, { "name": "heading", "type": ["int","null"] }, { "name": "terminalid", "type": "int" }, { "name": "customerid", "type": "string" }, { "name": "mileage", "type": ["int","null"] }, { "name": "creationtime", "type": "long" }, { "name": "tracetype", "type": "int" }, { "name": "traceproperties", "type": { "type": "array", "items": { "name": "traceproperty", "type": "record", "fields": [ { "name": "id", "type": "long" }, { "name": "value", "type": "string" }, { "name": "pkey", "type": "string" } ] } } } ] } Thanks! Kind regards, Bart Cheolsoo Park schreef op 25.11.2012 15:33: > Hi Bart, > > avro = load '/data/2012/trace_ejb3/2012-**01-*.avro' USING > AvroStorage(); > gives me: > Schema for avro unknown. > > This should work. The error that you're getting is not from > AvroStorage but > PigServer. > > grep -r "Schema for .* unknown" * > src/org/apache/pig/PigServer.java: > System.out.println("Schema for " + alias + " unknown."); > ... > > It looks like that you have an error in your Pig script. Can you > please > provide your Pig script and the schema of your avro files that > reproduce > the error? > > Thanks, > Cheolsoo > > > On Sun, Nov 25, 2012 at 1:02 AM, Bart Verwilst <[EMAIL PROTECTED]> > wrote: > >> Hi, >> >> I've tried loading a csv with PigStorage(), getting this: >> >> >> txt = load '/import.mysql/trace_ejb3_**2011/part-m-00000' USING >> PigStorage(','); >> describe txt; >> >> Schema for txt unknown. >> >> Maybe this is because of it being a csv, so a schema is hard to >> figure >> out.. >> >> Any other suggestions? Our whole hadoop setup is built around being >> able >> to selectively load avro files to run our jobs on, if this doesn't >> work >> then we're pretty much screwed.. :) >> >> Thanks in advance! >> >> Bart >> >> Russell Jurney schreef op 24.11.2012 20:23: >> >> I suspect the problem is AvroStorage, not globbing. Try this with >>> pigstorage. >>> >>> Russell Jurney twitter.com/rjurney >>> >>> >>> On Nov 24, 2012, at 5:15 AM, Bart Verwilst <[EMAIL PROTECTED]> >>> wrote: >>> >>> Hello, >>>> >>>> Thanks for your suggestion! >>>> I switch my avro variable to avro = load '$INPUT' USING >>>> AvroStorage(); >>>> >>>> However I get the same results this way: >>>> >>>> $ pig -p INPUT=/data/2012/trace_ejb3/**2012-01-02.avro >>>> avro-test.pig >>>> which: no hbase in (:/usr/lib64/qt-3.3/bin:/usr/** >>>> java/jdk1.6.0_33/bin/:/usr/**local/bin:/bin:/usr/bin:/usr/** >>>> local/sbin:/usr/sbin:/sbin:/**usr/local/bin) >>>> <snip> >>>> avro: {id: long,timestamp: long,latitude: int,longitude: >>>> int,speed: >>>> int,heading: int,terminalid: int,customerid: chararray,mileage: >>>> int,creationtime: long,tracetype: int,traceproperties: >>>> {ARRAY_ELEM: (id: >>>> long,value: chararray,pkey: chararray)}} +
Bart Verwilst 2012-11-25, 20:14
|