Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Reading Avro files with Pig


+
Bart Verwilst 2012-11-19, 15:59
Copy link to this message
-
Re: Reading Avro files with Pig
Hi Bart,

Please try to print out the schema of 'avro' using 'DESCRIBE avro'. This
will show you the field names in the relation.

avro = load '/import/2012-01-04-deflate.**avro' USING AvroStorage();
DESCRIBE avro;

Given your description, I suppose that changing 'trace.terminalid' to
'avro.terminalid' will make your error go away.

Thanks,
Cheolsoo

On Mon, Nov 19, 2012 at 7:59 AM, Bart Verwilst <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I'm trying to read the Avro file i stored on HDFS, but I seem to be
> hitting a snag. I'm hoping some of you will be able to shed some light on
> this and allow me to continue my adventure!
>
>
> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
> REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar';
> REGISTER 'hdfs:///lib/piggybank.jar';
>
> DEFINE AvroStorage org.apache.pig.piggybank.**storage.avro.AvroStorage();
>
> avro = load '/import/2012-01-04-deflate.**avro' USING AvroStorage();
>
> groups = group avro by trace.terminalid;
> sc = foreach groups generate group as terminalid, COUNT(avro) as cnt;
>
> store sc into '/import/test-out.avro' USING AvroStorage();
>
>
>
> The schema of the avro file:
>
> {
>     "type": "record",
>     "name": "trace",
>     "namespace": "asp",
>     "fields": [
>         {   "name": "id"   , "type": "long"   },
>         {   "name": "timestamp"    , "type": "long"      },
>         {   "name": "terminalid", "type": "int"   },
>         {   "name": "creationtime", "type": "long"   },
>         {   "name": "tracetype", "type": "int"   },
>         {   "name": "traceproperties", "type": {
>                 "type": "array",
>                 "items": {
>                     "name": "traceproperty",
>                     "type": "record",
>                     "fields": [
>                         {    "name": "id", "type": "long"    },
>                         {    "name": "value", "type": "string"    },
>                         {    "name": "pkey", "type": "string"    },
>                         {    "name": "traceid", "type": "long"    }
>                     ]
>                 }
>             }
>         }
>     ]
> }
>
>
> The script above gives me:
>
> <file avro-test.pig, line 9, column 28> Invalid field reference.
> Referenced field [terminalid] does not exist in schema: .
>
> So I guess I'm missing the point on how to interface with the schema here?
>
> Thanks in advance!
>
> Kind regards,
>
> Bart
>