I noticed in AVRO-1006 there was a mention of standardizing on some kind of
schema repository that would maintain a central set of all versions of a
schema and allow a way to reference schemas by id.
At LinkedIn we have standardized (almost) all of our persistent data on
Avro and we have a repository like this for managing schemas. Messages are
stored with the schema in Hadoop, but for systems that store rows
independently like databases or messaging we instead store a schema id with
each row/message. We would love for there to be an open source version of
this to make it possible to open up our other tools
for compatibility checking, etl and other things that depend on service.
The service itself is basically a REST service that maintains schemas. Each
schema has a "source" that it is associated with (the table or messaging
topic or whatever) and a unique id. Schemas can be fetched by id or you can
get the latest schema for a given source. Having the notion of sources
allows us to do two things: (1) enforce a compatibility modal on schema
changes (no backwards incompatible changes for various definitions of
backwards compatibility), and (2) allow our hadoop etl to project all
messages forward to the latest schema (since AvroFile requires a single
schema not a per-row schema).
If the Avro project is interested in adopting an official repository that
would be really nice. It is frankly a pretty trivial piece of code, but
standardization would allow interoperability between things. I would be
willing to either open source our repository implementation or do a
from-scratch one if we come up with more requirements.
Doug Cutting 2012-07-10, 18:25
Scott Carey 2012-07-10, 21:54
Jay Kreps 2012-07-11, 00:57
Scott Carey 2012-07-10, 20:37
Doug Cutting 2012-07-10, 20:40