Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # dev >> schema repositories?


+
Jay Kreps 2012-07-10, 17:53
Copy link to this message
-
Re: schema repositories?
Jay,

This sounds to me like something of general utility that would make a
great addition to Avro.

To be clear, I assume you mean contributing this as source code for a
service that folks can deploy, right?  For example, it might be a Java
project that builds a WAR file that, when deployed, presents a REST
front end and talks to a backing persistence layer where the schemas
are stored.  Is that right?

Also note that Avro recently added a standard facility for defining
Schema fingerprints that might be used as Schema IDs in such a
service:

http://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas

This has currently been implemented in Java and C#:

http://avro.apache.org/docs/current/api/java/org/apache/avro/SchemaNormalization.html

I like the notion of a Schema source for the uses you describe.  For
records might this simply be the fully-qualified record name?  For
unions and other unnamed types it might take the same form as a record
name.  We could use the "long form" for primitives and always supply a
name so that an string schema might be {"type":"string",
"name":"org.foo.Bar"}.  Would this work, or is there some other
structure and use of sources for which schema names are not a good
match?

Cheers,

Doug

On Tue, Jul 10, 2012 at 10:53 AM, Jay Kreps <[EMAIL PROTECTED]> wrote:
> I noticed in AVRO-1006 there was a mention of standardizing on some kind of
> schema repository that would maintain a central set of all versions of a
> schema and allow a way to reference schemas by id.
>
> At LinkedIn we have standardized (almost) all of our persistent data on
> Avro and we have a repository like this for managing schemas. Messages are
> stored with the schema in Hadoop, but for systems that store rows
> independently like databases or messaging we instead store a schema id with
> each row/message. We would love for there to be an open source version of
> this to make it possible to open up our other tools
> for compatibility checking, etl and other things that depend on service.
>
> The service itself is basically a REST service that maintains schemas. Each
> schema has a "source" that it is associated with (the table or messaging
> topic or whatever) and a unique id. Schemas can be fetched by id or you can
> get the latest schema for a given source. Having the notion of sources
> allows us to do two things: (1) enforce a compatibility modal on schema
> changes (no backwards incompatible changes for various definitions of
> backwards compatibility), and (2) allow our hadoop etl to project all
> messages forward to the latest schema (since AvroFile requires a single
> schema not a per-row schema).
>
> If the Avro project is interested in adopting an official repository that
> would be really nice. It is frankly a pretty trivial piece of code, but
> standardization would allow interoperability between things. I would be
> willing to either open source our repository implementation or do a
> from-scratch one if we come up with more requirements.
>
> -Jay
+
Scott Carey 2012-07-10, 21:54
+
Jay Kreps 2012-07-11, 00:57
+
Scott Carey 2012-07-10, 20:37
+
Doug Cutting 2012-07-10, 20:40
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB