Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Pig AvroStorage : storing the data


+
Milind Vaidya 2013-01-11, 16:12
Copy link to this message
-
Re: Pig AvroStorage : storing the data
Hi,

Here is a working version of your example.
1) AvroStorage Load -> AvroStorage Store -> AvroStorage Load

-----
REGISTER build/ivy/lib/Pig/avro-1.7.1.jar
REGISTER build/ivy/lib/Pig/json-simple-1.1.jar
REGISTER contrib/piggybank/java/piggybank.jar

DEFINE AVRO_LOAD_1
org.apache.pig.piggybank.storage.avro.AvroStorage('multiple_schemas');
DEFINE AVRO_LOAD_2 org.apache.pig.piggybank.storage.avro.AvroStorage();
DEFINE AVRO_STORE
 org.apache.pig.piggybank.storage.avro.AvroStorage('same',
'AvroData/employee.avro');

employee = LOAD 'AvroData' USING AVRO_LOAD_1;
DUMP employee;

STORE employee INTO 'StoredAvro' USING AVRO_STORE;

employee = LOAD 'StoredAvro' USING AVRO_LOAD_2;
DUMP employee;
-----

Please note that:
* The 2nd Avro load command defines the schema by the 'same' option. It
means it will store the relation 'emplyee' using the same schema of
'AvroData/employee.avro'. Alternatively, you can specify the schema using
JSON string by the 'schema' option. For example, AvroStorage('schema',
'<JSON string>').
* I moved StoredAvro out of AvroData. This is because AvroStorage loads
directories recursively. If I run this script multiple times, I will load
files not only files in AvroData but also in AvroData/StoredAvor from a
previous run. Therefore, I am using separate directories for input and
output.
2) AvroStorage Load -> PigStorage Store -> PigStorage Load

-----
DEFINE AVRO_LOAD
org.apache.pig.piggybank.storage.avro.AvroStorage('multiple_schemas');

employee = LOAD 'AvroData' USING AVRO_LOAD;
DUMP employee;

STORE employee INTO 'StoredText' USING PigStorage(',');

employee = LOAD 'StoredText' USING PigStorage(',') as (name:chararray,
age:int, dept:chararray, office:chararray, salary:int, lastname:chararray);
DUMP employee;
-----
3) Regarding your errors:

* ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve
AvroStorage using imports: [, org.apache.pig.builtin.,
org.apache.pig.impl.builtin.]
This is because you didn't use fully qualified name of AvroStorage in your
script. Pig assumes default qualifiers if no qualifier is given.

* ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2245: Cannot get schema
from loadFunc org.apache.pig.piggybank.storage.avro.AvroStorage
This can happen you load non-Avro files (e.g. text files) using
AvroStorage. For example, if you store data using AvroStorage() without a
schema, they will be stored as a text file. Then, if you load them again
using AvroStorage, you will get this error. It's hard to tell exactly how
you run into this situation, but given that you're writing files into a
sub-directory of the input directory, you probably loaded text files stored
from a previous run. This is why I recommend you should separated input and
output directories.

Thanks,
Cheolsoo
+
Milind Vaidya 2013-01-11, 21:02
+
Russell Jurney 2013-01-11, 21:04