|
|
-
AvroStorage and duplicate records
Alex Holmes 2011-09-22, 00:01
Hi all,
I have a simple schema
{"name": "Record", "type": "record", "fields": [ {"name": "name", "type": "string"}, {"name": "id", "type": "int"} ] }
which I use to write 2 records to an Avro file with the following code:
public static Record createRecord(String name, int id) { Record record = new Record(); record.name = name; record.id = id; return record; }
public static void writeToAvro(OutputStream outputStream) throws IOException { DataFileWriter<Record> writer new DataFileWriter<Record>(new SpecificDatumWriter<Record>()); writer.create(Record.SCHEMA$, outputStream);
writer.append(createRecord("r1", 1)); writer.append(createRecord("r2", 2));
writer.close(); outputStream.close(); }
I also have some reader code which reads in the file and just dumps the contents of each Record:
DataFileStream<Record> reader = new DataFileStream<Record>( is, new SpecificDatumReader<Record>(Record.SCHEMA$)); for (Record a : reader) { System.out.println(ToStringBuilder.reflectionToString(a)); }
Its output is:
Record@1e9e5c73[name=r1,id=1] Record@ed42d08[name=r2,id=2]
When using this file with pig and AvroStorage, pig seems to think there are 4 records:
grunt> REGISTER /app/hadoop/lib/avro-1.5.4.jar; grunt> REGISTER /app/pig-0.9.0/contrib/piggybank/java/piggybank.jar; grunt> REGISTER /app/pig-0.9.0/build/ivy/lib/Pig/json-simple-1.1.jar; grunt> REGISTER /app/pig-0.9.0/build/ivy/lib/Pig/jackson-core-asl-1.6.0.jar; grunt> REGISTER /app/pig-0.9.0/build/ivy/lib/Pig/jackson-mapper-asl-1.6.0.jar; grunt> raw = LOAD 'test.v1.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage; grunt> dump raw; .. Input(s): Successfully read 4 records (825 bytes) from: "hdfs://localhost:9000/user/aholmes/test.v1.avro"
Output(s): Successfully stored 4 records (46 bytes) in: "hdfs://localhost:9000/tmp/temp2039109003/tmp1924774585"
Counters: Total records written : 4 Total bytes written : 46 .. (r1,1) (r2,2) (r1,1) (r2,2)
I'm sure I'm doing something wrong, but would appreciate any help.
Many thanks, Alex
-
Re: AvroStorage and duplicate records
Alex Holmes 2011-09-26, 10:05
Hi,
Anyone have any idea what could be up? Is the Avro support in Pig not ready for prime time yet?
Thanks, Alex On Wed, Sep 21, 2011 at 8:01 PM, Alex Holmes <[EMAIL PROTECTED]> wrote: > Hi all, > > I have a simple schema > > {"name": "Record", "type": "record", > "fields": [ > {"name": "name", "type": "string"}, > {"name": "id", "type": "int"} > ] > } > > which I use to write 2 records to an Avro file with the following code: > > public static Record createRecord(String name, int id) { > Record record = new Record(); > record.name = name; > record.id = id; > return record; > } > > public static void writeToAvro(OutputStream outputStream) > throws IOException { > DataFileWriter<Record> writer > new DataFileWriter<Record>(new SpecificDatumWriter<Record>()); > writer.create(Record.SCHEMA$, outputStream); > > writer.append(createRecord("r1", 1)); > writer.append(createRecord("r2", 2)); > > writer.close(); > outputStream.close(); > } > > I also have some reader code which reads in the file and just dumps > the contents of each Record: > > DataFileStream<Record> reader = new DataFileStream<Record>( > is, new SpecificDatumReader<Record>(Record.SCHEMA$)); > for (Record a : reader) { > System.out.println(ToStringBuilder.reflectionToString(a)); > } > > Its output is: > > Record@1e9e5c73[name=r1,id=1] > Record@ed42d08[name=r2,id=2] > > When using this file with pig and AvroStorage, pig seems to think > there are 4 records: > > grunt> REGISTER /app/hadoop/lib/avro-1.5.4.jar; > grunt> REGISTER /app/pig-0.9.0/contrib/piggybank/java/piggybank.jar; > grunt> REGISTER /app/pig-0.9.0/build/ivy/lib/Pig/json-simple-1.1.jar; > grunt> REGISTER /app/pig-0.9.0/build/ivy/lib/Pig/jackson-core-asl-1.6.0.jar; > grunt> REGISTER /app/pig-0.9.0/build/ivy/lib/Pig/jackson-mapper-asl-1.6.0.jar; > grunt> raw = LOAD 'test.v1.avro' USING > org.apache.pig.piggybank.storage.avro.AvroStorage; > grunt> dump raw; > .. > Input(s): > Successfully read 4 records (825 bytes) from: > "hdfs://localhost:9000/user/aholmes/test.v1.avro" > > Output(s): > Successfully stored 4 records (46 bytes) in: > "hdfs://localhost:9000/tmp/temp2039109003/tmp1924774585" > > Counters: > Total records written : 4 > Total bytes written : 46 > .. > (r1,1) > (r2,2) > (r1,1) > (r2,2) > > I'm sure I'm doing something wrong, but would appreciate any help. > > Many thanks, > Alex >
-
Re: AvroStorage and duplicate records
Thejas Nair 2011-09-27, 05:43
I am not an avro expert, but one thing to check would be the avro versions used to write the data is compatible with the version used with pig to read the data. -Thejas On 9/26/11 3:05 AM, Alex Holmes wrote: > Hi, > > Anyone have any idea what could be up? Is the Avro support in Pig not > ready for prime time yet? > > Thanks, > Alex > > > On Wed, Sep 21, 2011 at 8:01 PM, Alex Holmes<[EMAIL PROTECTED]> wrote: >> Hi all, >> >> I have a simple schema >> >> {"name": "Record", "type": "record", >> "fields": [ >> {"name": "name", "type": "string"}, >> {"name": "id", "type": "int"} >> ] >> } >> >> which I use to write 2 records to an Avro file with the following code: >> >> public static Record createRecord(String name, int id) { >> Record record = new Record(); >> record.name = name; >> record.id = id; >> return record; >> } >> >> public static void writeToAvro(OutputStream outputStream) >> throws IOException { >> DataFileWriter<Record> writer >> new DataFileWriter<Record>(new SpecificDatumWriter<Record>()); >> writer.create(Record.SCHEMA$, outputStream); >> >> writer.append(createRecord("r1", 1)); >> writer.append(createRecord("r2", 2)); >> >> writer.close(); >> outputStream.close(); >> } >> >> I also have some reader code which reads in the file and just dumps >> the contents of each Record: >> >> DataFileStream<Record> reader = new DataFileStream<Record>( >> is, new SpecificDatumReader<Record>(Record.SCHEMA$)); >> for (Record a : reader) { >> System.out.println(ToStringBuilder.reflectionToString(a)); >> } >> >> Its output is: >> >> Record@1e9e5c73[name=r1,id=1] >> Record@ed42d08[name=r2,id=2] >> >> When using this file with pig and AvroStorage, pig seems to think >> there are 4 records: >> >> grunt> REGISTER /app/hadoop/lib/avro-1.5.4.jar; >> grunt> REGISTER /app/pig-0.9.0/contrib/piggybank/java/piggybank.jar; >> grunt> REGISTER /app/pig-0.9.0/build/ivy/lib/Pig/json-simple-1.1.jar; >> grunt> REGISTER /app/pig-0.9.0/build/ivy/lib/Pig/jackson-core-asl-1.6.0.jar; >> grunt> REGISTER /app/pig-0.9.0/build/ivy/lib/Pig/jackson-mapper-asl-1.6.0.jar; >> grunt> raw = LOAD 'test.v1.avro' USING >> org.apache.pig.piggybank.storage.avro.AvroStorage; >> grunt> dump raw; >> .. >> Input(s): >> Successfully read 4 records (825 bytes) from: >> "hdfs://localhost:9000/user/aholmes/test.v1.avro" >> >> Output(s): >> Successfully stored 4 records (46 bytes) in: >> "hdfs://localhost:9000/tmp/temp2039109003/tmp1924774585" >> >> Counters: >> Total records written : 4 >> Total bytes written : 46 >> .. >> (r1,1) >> (r2,2) >> (r1,1) >> (r2,2) >> >> I'm sure I'm doing something wrong, but would appreciate any help. >> >> Many thanks, >> Alex >>
|
|