|
|
-
Reading Avro files with Pig
Bart Verwilst 2012-11-19, 15:59
Hi,
I'm trying to read the Avro file i stored on HDFS, but I seem to be hitting a snag. I'm hoping some of you will be able to shed some light on this and allow me to continue my adventure! REGISTER 'hdfs:///lib/avro-1.7.2.jar'; REGISTER 'hdfs:///lib/json-simple-1.1.1.jar'; REGISTER 'hdfs:///lib/piggybank.jar';
DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
avro = load '/import/2012-01-04-deflate.avro' USING AvroStorage();
groups = group avro by trace.terminalid; sc = foreach groups generate group as terminalid, COUNT(avro) as cnt;
store sc into '/import/test-out.avro' USING AvroStorage();
The schema of the avro file:
{ "type": "record", "name": "trace", "namespace": "asp", "fields": [ { "name": "id" , "type": "long" }, { "name": "timestamp" , "type": "long" }, { "name": "terminalid", "type": "int" }, { "name": "creationtime", "type": "long" }, { "name": "tracetype", "type": "int" }, { "name": "traceproperties", "type": { "type": "array", "items": { "name": "traceproperty", "type": "record", "fields": [ { "name": "id", "type": "long" }, { "name": "value", "type": "string" }, { "name": "pkey", "type": "string" }, { "name": "traceid", "type": "long" } ] } } } ] } The script above gives me:
<file avro-test.pig, line 9, column 28> Invalid field reference. Referenced field [terminalid] does not exist in schema: .
So I guess I'm missing the point on how to interface with the schema here?
Thanks in advance!
Kind regards,
Bart
-
Re: Reading Avro files with Pig
Cheolsoo Park 2012-11-19, 18:36
Hi Bart,
Please try to print out the schema of 'avro' using 'DESCRIBE avro'. This will show you the field names in the relation.
avro = load '/import/2012-01-04-deflate.**avro' USING AvroStorage(); DESCRIBE avro;
Given your description, I suppose that changing 'trace.terminalid' to 'avro.terminalid' will make your error go away.
Thanks, Cheolsoo
On Mon, Nov 19, 2012 at 7:59 AM, Bart Verwilst <[EMAIL PROTECTED]> wrote:
> Hi, > > I'm trying to read the Avro file i stored on HDFS, but I seem to be > hitting a snag. I'm hoping some of you will be able to shed some light on > this and allow me to continue my adventure! > > > REGISTER 'hdfs:///lib/avro-1.7.2.jar'; > REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar'; > REGISTER 'hdfs:///lib/piggybank.jar'; > > DEFINE AvroStorage org.apache.pig.piggybank.**storage.avro.AvroStorage(); > > avro = load '/import/2012-01-04-deflate.**avro' USING AvroStorage(); > > groups = group avro by trace.terminalid; > sc = foreach groups generate group as terminalid, COUNT(avro) as cnt; > > store sc into '/import/test-out.avro' USING AvroStorage(); > > > > The schema of the avro file: > > { > "type": "record", > "name": "trace", > "namespace": "asp", > "fields": [ > { "name": "id" , "type": "long" }, > { "name": "timestamp" , "type": "long" }, > { "name": "terminalid", "type": "int" }, > { "name": "creationtime", "type": "long" }, > { "name": "tracetype", "type": "int" }, > { "name": "traceproperties", "type": { > "type": "array", > "items": { > "name": "traceproperty", > "type": "record", > "fields": [ > { "name": "id", "type": "long" }, > { "name": "value", "type": "string" }, > { "name": "pkey", "type": "string" }, > { "name": "traceid", "type": "long" } > ] > } > } > } > ] > } > > > The script above gives me: > > <file avro-test.pig, line 9, column 28> Invalid field reference. > Referenced field [terminalid] does not exist in schema: . > > So I guess I'm missing the point on how to interface with the schema here? > > Thanks in advance! > > Kind regards, > > Bart >
|
|