|
|
-
Re: Pig AvroStorage : storing the dataCheolsoo Park 2013-01-11, 19:33
Hi,
Here is a working version of your example. 1) AvroStorage Load -> AvroStorage Store -> AvroStorage Load ----- REGISTER build/ivy/lib/Pig/avro-1.7.1.jar REGISTER build/ivy/lib/Pig/json-simple-1.1.jar REGISTER contrib/piggybank/java/piggybank.jar DEFINE AVRO_LOAD_1 org.apache.pig.piggybank.storage.avro.AvroStorage('multiple_schemas'); DEFINE AVRO_LOAD_2 org.apache.pig.piggybank.storage.avro.AvroStorage(); DEFINE AVRO_STORE org.apache.pig.piggybank.storage.avro.AvroStorage('same', 'AvroData/employee.avro'); employee = LOAD 'AvroData' USING AVRO_LOAD_1; DUMP employee; STORE employee INTO 'StoredAvro' USING AVRO_STORE; employee = LOAD 'StoredAvro' USING AVRO_LOAD_2; DUMP employee; ----- Please note that: * The 2nd Avro load command defines the schema by the 'same' option. It means it will store the relation 'emplyee' using the same schema of 'AvroData/employee.avro'. Alternatively, you can specify the schema using JSON string by the 'schema' option. For example, AvroStorage('schema', '<JSON string>'). * I moved StoredAvro out of AvroData. This is because AvroStorage loads directories recursively. If I run this script multiple times, I will load files not only files in AvroData but also in AvroData/StoredAvor from a previous run. Therefore, I am using separate directories for input and output. 2) AvroStorage Load -> PigStorage Store -> PigStorage Load ----- DEFINE AVRO_LOAD org.apache.pig.piggybank.storage.avro.AvroStorage('multiple_schemas'); employee = LOAD 'AvroData' USING AVRO_LOAD; DUMP employee; STORE employee INTO 'StoredText' USING PigStorage(','); employee = LOAD 'StoredText' USING PigStorage(',') as (name:chararray, age:int, dept:chararray, office:chararray, salary:int, lastname:chararray); DUMP employee; ----- 3) Regarding your errors: * ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve AvroStorage using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.] This is because you didn't use fully qualified name of AvroStorage in your script. Pig assumes default qualifiers if no qualifier is given. * ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2245: Cannot get schema from loadFunc org.apache.pig.piggybank.storage.avro.AvroStorage This can happen you load non-Avro files (e.g. text files) using AvroStorage. For example, if you store data using AvroStorage() without a schema, they will be stored as a text file. Then, if you load them again using AvroStorage, you will get this error. It's hard to tell exactly how you run into this situation, but given that you're writing files into a sub-directory of the input directory, you probably loaded text files stored from a previous run. This is why I recommend you should separated input and output directories. Thanks, Cheolsoo |