Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Schema not getting saved along with Data


Copy link to this message
-
Re: Schema not getting saved along with Data
On 1 Apr 2014, at 11:12, Lewis John Mcgibbney <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Right now we maintain only the Writer's schema, which as I mentioned is appended within the generated Persistent Java bean. In my own experience (and as you've hinted at :) ) this had/has caused us problems in the past.
For example we added a new (pretty innocent) string Field 'batchId' to our WebPage Schema [0] over in Nutch meaning that new Records being written included it and older records already within the data set did not.
{"name": "batchId", "type": "string"}
This inevitably threw NPE when certain Tools attempted to access certain records which the batchId Field and value was absent.

I have seen several people get confused about this before -- you're not alone. I actually think the fact that you have two different schemas when reading is the thing that most confuses people who are new to Avro. It's so different from what most people are used to.

So taking a bit of advice from a well recognized voice in this area (uh hum ;))

Haha ;)

For those following along on the mailing list, Lewis quoted from my blog post: http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

Fortunately in the above example this particular Schema has only changed once in some 2 or 3 years. However it HAS changed.

It's probably safe to assume that every schema will have to change sooner or later.

Looks like I am also taking a lesson from this thread and we have a bit more work to do on Gora to address the above points. This is of course unless I have missed something!

A proposal to create a registry of Avro schemas has been a long time coming (https://issues.apache.org/jira/browse/AVRO-1124). This would allow you to include a small version number or hash of the schema in each record, to indicate the writer schema that was used to encode it. That would be much lower overhead than including the entire schema with every record.

As Gora is itself a database access layer, you can probably store the schemas in the same database as the records. If you go ahead and implement this, it would be great if you could keep compatibility with the AVRO-1124 schema registry in mind.

If Gora can hide the writer/reader schema distinction from users, and just do the right thing with schema evolution, that would be awesome!

Martin