|
|
-
Re: LOAD multiple files with globRussell Jurney 2012-11-26, 18:23
Is the globbing feature making it into the AvroStorage rewrite?
Russell Jurney twitter.com/rjurney On Nov 26, 2012, at 7:50 AM, Bart Verwilst <[EMAIL PROTECTED]> wrote: > To answer myself again, I compiled Pig 0.11 and Piggybank, and it's working very well now, globbing seems to be fully supported! > > Bart Verwilst schreef op 26.11.2012 15:33: >> To answer myself, could this be part of the solution? : >> >> https://issues.apache.org/jira/browse/PIG-2492 >> >> Guess I'll have to wait for 0.11 then? >> >> Bart Verwilst schreef op 26.11.2012 14:19: >>> 14:16:08 centos6-hadoop-hishiru ~ $ cat avro-test.pig >>> REGISTER 'hdfs:///lib/avro-1.7.2.jar'; >>> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar'; >>> REGISTER 'hdfs:///lib/piggybank.jar'; >>> >>> DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); >>> avro = load '/test/*' USING AvroStorage(); >>> describe avro; >>> >>> 14:16:09 centos6-hadoop-hishiru ~ $ pig avro-test.pig >>> Schema for avro unknown. >>> >>> 14:16:17 centos6-hadoop-hishiru ~ $ vim avro-test.pig >>> >>> 14:16:25 centos6-hadoop-hishiru ~ $ cat avro-test.pig >>> REGISTER 'hdfs:///lib/avro-1.7.2.jar'; >>> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar'; >>> REGISTER 'hdfs:///lib/piggybank.jar'; >>> >>> DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); >>> avro = load '/test/2012-11-25.avro' USING AvroStorage(); >>> describe avro; >>> >>> 14:16:30 centos6-hadoop-hishiru ~ $ pig avro-test.pig >>> avro: {id: long,timestamp: long,latitude: int,longitude: int,speed: >>> int,heading: int,terminalid: int,customerid: chararray,mileage: >>> int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM: >>> (id: long,value: chararray,pkey: chararray)}} >>> >>> 14:16:55 centos6-hadoop-hishiru ~ $ hadoop fs -ls /test/ >>> Found 1 items >>> -rw-r--r-- 3 hdfs supergroup 63140500 2012-11-26 14:13 /test/2012-11-25.avro >>> >>> Cheolsoo Park schreef op 26.11.2012 10:45: >>>> Hi, >>>> >>>>>> Invalid field projection. Projected field [tracetype] does not exist. >>>> >>>> The error indicates that the "tracetype" doesn't exist in the Pig schema of >>>> the relation "avro". What AvroStorage does is to automatically convert Avro >>>> schema to Pig schema during the load. Although you have "tracetype" in your >>>> Avro schema, "tracetype" doesn't exist in the generated Pig schema for >>>> whatever reason. >>>> >>>> Can you please try to "describe avro"? You can replace group and dump >>>> commands with describe in your Pig script. This will show you what the Pig >>>> schema of "avro" is. If "tracetype" indeed doesn't exist, you have to find >>>> out why it doesn't. It could be because the schema of .avro files is not >>>> the same or because there is a bug in AvroStorage, etc. >>>> >>>>>> Maybe globbing with [] doesnt work, but wildcard works? >>>> >>>> You're right. AvroStorage internally uses Hadoop path globing, and Hadoop >>>> path globing doesn't support '[ ]'. But the above error (Projected field >>>> [tracetype] does not exist) is not because of this. URISyntaxException is >>>> what you will get because of '[ ]'. >>>> >>>> Thanks, >>>> Cheolsoo >>>> >>>> >>>> >>>> On Sun, Nov 25, 2012 at 10:25 AM, Bart Verwilst <[EMAIL PROTECTED]> wrote: >>>> >>>>> Just tried this: >>>>> >>>>> >>>>> ------------------------------**---------------------- >>>>> REGISTER 'hdfs:///lib/avro-1.7.2.jar'; >>>>> REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar'; >>>>> REGISTER 'hdfs:///lib/piggybank.jar'; >>>>> >>>>> DEFINE AvroStorage org.apache.pig.piggybank.**storage.avro.AvroStorage(); >>>>> >>>>> avro = load '/data/2012/trace_ejb3/2012-**01-0*.avro' USING AvroStorage(); >>>>> >>>>> groups = group avro by tracetype; >>>>> >>>>> dump groups; >>>>> ------------------------------**---------------------- >>>>> >>>>> gave me: >>>>> >>>>> <file avro-test.pig, line 10, column 23> Invalid field projection. >>>>> Projected field [tracetype] does not exist. >>>>> >>>>> Pig Stack Trace >>>>> ------------- |