Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # dev >> A case for adding revision field to Avro schema

Copy link to this message
RE: A case for adding revision field to Avro schema
Thanks Doug.

> Where would this union be stored?  Is it only stored in the application,
> or is it stored with the data?  I think it would be safest to somehow
> store it with the dataset, not in the application.

I agree. It should be stored along with the data. Without the schema it the
data is meaningless.

> It sounds like perhaps you're trying to optimize the size of the pointer
> from each stored instance to its schema.  Is that correct?

Not really, we can optimize on size by using the table approach as you
mention or other means. My motivation is to avoid the application having to
interpret the first few bytes and Avro the rest. You capture my intent very
precisely in a subsequent paragraph:

> ... The only operation that's simplified
> is that the top-level union dispatch at read and possibly write would
> use Avro logic instead of application logic. ...

The user can have a layer on top of Avro to insert these few bytes during
write and interpret them during read. But my point is that if Avro can be
made to do that, it is better and is available to every Avro user.

> So, if we allowed multiple branches of
> the same name in a top-level union at read time then this might work.


> A way to address this might be through aliases.  If, in the union, each
> branch but the last, the record has a versioned name, i.e., the union is
> ["r0", "r1", .., "r"], then writing would work.  If "r" then has aliases
> of ["r0", "r1", ..], then, at read-time, the union would be rewritten as
> ["r", "r", ...], but where each branch has a different definition.
> Currently this would fail due to the duplicate names, but if we changed
> it that so that, in the context of alias rewrites while reading, we
> permit duplicate names in a top-level union, then this could work as
> desired.

This solves my problem. That is, the new matching rule would be:

For resolving two named schemas, if the type of schemas are identical (enum,
fixed or record) if the name of the writer-schema matches either name of the
reader-schema or one of the aliases of reader-schema we try to match the
contents of the schemas. By contents, I mean fields for the record and size
for the fixed etc.

Right now, we give up as soon as realize that the names do not match.

This idea is functionally equivalent to the revision idea, but it is better
because it rides on top of an existing proposal for aliases and does not
introduce a new concept/construct.