|
|
-
Confusion re. persisting the schema
Christopher Hunt 2010-10-12, 05:58
Hi there,
I've just noticed that when I write out my binary data I don't appear to have a schema saved with it. I was under the impression that Avro saves schemas along with the data. Thanks for any clarification.
Here's my schema:
{ "name": "FileDependency", "type": "record", "fields": [ {"name": "file", "type": "string"}, {"name": "imports", "type": { "type": "array", "items": "string"} } ] }
The code to write out my data is as follows (also appreciate any refinement suggestions as I'm new to Avro):
@Cleanup InputStream fileDependencySchemaIs = this.getClass() .getResourceAsStream(FILE_DEPENDENCY_GRAPH_SCHEMA_NAME); Schema fileDependencySchema = Schema.parse(fileDependencySchemaIs);
GenericDatumWriter<GenericRecord> genericDatumWriter = new GenericDatumWriter<GenericRecord>(fileDependencySchema); @Cleanup OutputStream os = new FileOutputStream(new File(workFolder, FILE_DEPENDENCY_GRAPH_NAME)); Encoder encoder = new BinaryEncoder(os); for (Map.Entry<String, Set<String>> entry : fileDependencies .entrySet()) {
GenericRecord genericRecord = new GenericData.Record( fileDependencySchema);
genericRecord.put("file", new Utf8(entry.getKey()));
Set<String> imports = entry.getValue(); GenericArray<Utf8> genericArray = new GenericData.Array<Utf8>( imports.size(), Schema.createArray(Schema.create(Type.STRING))); for (String importFile : imports) { genericArray.add(new Utf8(importFile)); } genericRecord.put("imports", genericArray);
genericDatumWriter.write(genericRecord, encoder); } encoder.flush();
Thanks again.
Kind regards, Christopher
+
Christopher Hunt 2010-10-12, 05:58
-
Re: Confusion re. persisting the schema
Harsh J 2010-10-12, 06:04
You are simply writing encoded data with that code. You need to use o.a.a.file.DataFileWriter to write proper avro datafiles (by appending your datum to it), which stores schema in its headers among other features.
On Oct 12, 2010 11:29 AM, "Christopher Hunt" <[EMAIL PROTECTED]> wrote:
Hi there,
I've just noticed that when I write out my binary data I don't appear to have a schema saved with it. I was under the impression that Avro saves schemas along with the data. Thanks for any clarification.
Here's my schema:
{ "name": "FileDependency", "type": "record", "fields": [ {"name": "file", "type": "string"}, {"name": "imports", "type": { "type": "array", "items": "string"} } ] }
The code to write out my data is as follows (also appreciate any refinement suggestions as I'm new to Avro):
@Cleanup InputStream fileDependencySchemaIs = this.getClass() .getResourceAsStream(FILE_DEPENDENCY_GRAPH_SCHEMA_NAME); Schema fileDependencySchema = Schema.parse(fileDependencySchemaIs);
GenericDatumWriter<GenericRecord> genericDatumWriter new GenericDatumWriter<GenericRecord>(fileDependencySchema); @Cleanup OutputStream os = new FileOutputStream(new File(workFolder, FILE_DEPENDENCY_GRAPH_NAME)); Encoder encoder = new BinaryEncoder(os); for (Map.Entry<String, Set<String>> entry : fileDependencies .entrySet()) {
GenericRecord genericRecord = new GenericData.Record( fileDependencySchema);
genericRecord.put("file", new Utf8(entry.getKey()));
Set<String> imports = entry.getValue(); GenericArray<Utf8> genericArray = new GenericData.Array<Utf8>( imports.size(), Schema.createArray(Schema.create(Type.STRING))); for (String importFile : imports) { genericArray.add(new Utf8(importFile)); } genericRecord.put("imports", genericArray);
genericDatumWriter.write(genericRecord, encoder); } encoder.flush();
Thanks again.
Kind regards, Christopher
+
Harsh J 2010-10-12, 06:04
-
Re: Confusion re. persisting the schema
Christopher Hunt 2010-10-12, 11:02
Following your pointer I worked it out. For the benefit of others:
@Cleanup InputStream fileDependencySchemaIs = this.getClass() .getResourceAsStream(FILE_DEPENDENCY_GRAPH_SCHEMA_NAME); Schema fileDependencySchema = Schema.parse(fileDependencySchemaIs);
GenericDatumWriter<GenericRecord> genericDatumWriter = new GenericDatumWriter<GenericRecord>(fileDependencySchema); @Cleanup DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(genericDatumWriter); dataFileWriter.create(fileDependencySchema, new File(workFolder, FILE_DEPENDENCY_GRAPH_NAME));
for (Map.Entry<String, Set<String>> entry : fileDependencies .entrySet()) {
GenericRecord genericRecord = new GenericData.Record( fileDependencySchema);
genericRecord.put("file", new Utf8(entry.getKey()));
Set<String> imports = entry.getValue(); GenericArray<Utf8> genericArray = new GenericData.Array<Utf8>( imports.size(), Schema.createArray(Schema.create(Type.STRING))); for (String importFile : imports) { genericArray.add(new Utf8(importFile)); } genericRecord.put("imports", genericArray);
dataFileWriter.append(genericRecord); } dataFileWriter.flush();
All is now well. I can similarly read in:
@Cleanup InputStream fileDependencySchemaIs = this.getClass() .getResourceAsStream(FILE_DEPENDENCY_GRAPH_SCHEMA_NAME); Schema fileDependencySchema = Schema.parse(fileDependencySchemaIs);
GenericDatumReader<GenericRecord> genericDatumReader = new GenericDatumReader<GenericRecord>(fileDependencySchema);
File file = new File(workFolder, FILE_DEPENDENCY_GRAPH_NAME);
@Cleanup DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file, genericDatumReader);
GenericRecord genericRecord = new GenericData.Record( fileDependencySchema); while (!dataFileReader.hasNext()) { genericRecord = dataFileReader.next(genericRecord);
String recordFile = ((Utf8) genericRecord.get("file")) .toString();
GenericData.Array<?> recordImportObjects = (GenericData.Array<?>) genericRecord.get("imports"); Set<String> imports = new HashSet<String>(); for (Object recordImportObject : recordImportObjects) { imports.add(((Utf8) recordImportObject).toString()); } fileDependencies.put(recordFile, imports); }
Thanks.
Kind regards, Christopher
On 12/10/2010, at 5:04 PM, Harsh J wrote:
> You are simply writing encoded data with that code. You need to use o.a.a.file.DataFileWriter to write proper avro datafiles (by appending your datum to it), which stores schema in its headers among other features.
+
Christopher Hunt 2010-10-12, 11:02
|
|