Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Reading Avro files with Pig


Copy link to this message
-
Re: Reading Avro files with Pig
Hi Bart,

Please try to print out the schema of 'avro' using 'DESCRIBE avro'. This
will show you the field names in the relation.

avro = load '/import/2012-01-04-deflate.**avro' USING AvroStorage();
DESCRIBE avro;

Given your description, I suppose that changing 'trace.terminalid' to
'avro.terminalid' will make your error go away.

Thanks,
Cheolsoo

On Mon, Nov 19, 2012 at 7:59 AM, Bart Verwilst <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I'm trying to read the Avro file i stored on HDFS, but I seem to be
> hitting a snag. I'm hoping some of you will be able to shed some light on
> this and allow me to continue my adventure!
>
>
> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
> REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar';
> REGISTER 'hdfs:///lib/piggybank.jar';
>
> DEFINE AvroStorage org.apache.pig.piggybank.**storage.avro.AvroStorage();
>
> avro = load '/import/2012-01-04-deflate.**avro' USING AvroStorage();
>
> groups = group avro by trace.terminalid;
> sc = foreach groups generate group as terminalid, COUNT(avro) as cnt;
>
> store sc into '/import/test-out.avro' USING AvroStorage();
>
>
>
> The schema of the avro file:
>
> {
>     "type": "record",
>     "name": "trace",
>     "namespace": "asp",
>     "fields": [
>         {   "name": "id"   , "type": "long"   },
>         {   "name": "timestamp"    , "type": "long"      },
>         {   "name": "terminalid", "type": "int"   },
>         {   "name": "creationtime", "type": "long"   },
>         {   "name": "tracetype", "type": "int"   },
>         {   "name": "traceproperties", "type": {
>                 "type": "array",
>                 "items": {
>                     "name": "traceproperty",
>                     "type": "record",
>                     "fields": [
>                         {    "name": "id", "type": "long"    },
>                         {    "name": "value", "type": "string"    },
>                         {    "name": "pkey", "type": "string"    },
>                         {    "name": "traceid", "type": "long"    }
>                     ]
>                 }
>             }
>         }
>     ]
> }
>
>
> The script above gives me:
>
> <file avro-test.pig, line 9, column 28> Invalid field reference.
> Referenced field [terminalid] does not exist in schema: .
>
> So I guess I'm missing the point on how to interface with the schema here?
>
> Thanks in advance!
>
> Kind regards,
>
> Bart
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB