Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro, mail # dev - schema repositories?


Copy link to this message
-
Re: schema repositories?
Doug Cutting 2012-07-10, 18:25
Jay,

This sounds to me like something of general utility that would make a
great addition to Avro.

To be clear, I assume you mean contributing this as source code for a
service that folks can deploy, right?  For example, it might be a Java
project that builds a WAR file that, when deployed, presents a REST
front end and talks to a backing persistence layer where the schemas
are stored.  Is that right?

Also note that Avro recently added a standard facility for defining
Schema fingerprints that might be used as Schema IDs in such a
service:

http://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas

This has currently been implemented in Java and C#:

http://avro.apache.org/docs/current/api/java/org/apache/avro/SchemaNormalization.html

I like the notion of a Schema source for the uses you describe.  For
records might this simply be the fully-qualified record name?  For
unions and other unnamed types it might take the same form as a record
name.  We could use the "long form" for primitives and always supply a
name so that an string schema might be {"type":"string",
"name":"org.foo.Bar"}.  Would this work, or is there some other
structure and use of sources for which schema names are not a good
match?

Cheers,

Doug

On Tue, Jul 10, 2012 at 10:53 AM, Jay Kreps <[EMAIL PROTECTED]> wrote:
> I noticed in AVRO-1006 there was a mention of standardizing on some kind of
> schema repository that would maintain a central set of all versions of a
> schema and allow a way to reference schemas by id.
>
> At LinkedIn we have standardized (almost) all of our persistent data on
> Avro and we have a repository like this for managing schemas. Messages are
> stored with the schema in Hadoop, but for systems that store rows
> independently like databases or messaging we instead store a schema id with
> each row/message. We would love for there to be an open source version of
> this to make it possible to open up our other tools
> for compatibility checking, etl and other things that depend on service.
>
> The service itself is basically a REST service that maintains schemas. Each
> schema has a "source" that it is associated with (the table or messaging
> topic or whatever) and a unique id. Schemas can be fetched by id or you can
> get the latest schema for a given source. Having the notion of sources
> allows us to do two things: (1) enforce a compatibility modal on schema
> changes (no backwards incompatible changes for various definitions of
> backwards compatibility), and (2) allow our hadoop etl to project all
> messages forward to the latest schema (since AvroFile requires a single
> schema not a per-row schema).
>
> If the Avro project is interested in adopting an official repository that
> would be really nice. It is frankly a pretty trivial piece of code, but
> standardization would allow interoperability between things. I would be
> willing to either open source our repository implementation or do a
> from-scratch one if we come up with more requirements.
>
> -Jay