Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # dev >> A case for adding revision field to Avro schema


Copy link to this message
-
Re: A case for adding revision field to Avro schema
At this point, I think the application should manage its own schema store.

The distinction here is very fundamental and is the main reason why Avro's
serialization is fundamentally different than Protocol Buffers (or Thrift,
which, in this sense, are interchangeable.)

Consider the following two schemas:

// version 1
struct Person {
  string name; // id 1, for PB
}

// version 2
struct Person {
  string name; // id 1 for PB
  int age; // id 2 for PB
}

In Avro, there is an assumption that the schema is stored with the data.
 So, the serialized version of a Person looks like <PersonSchema_v1,
"philip">.  Implementations of storage (like AvroDataFile or the RPC
handshake protocol) do their best to reduce the size of the schema storage.
 The serialized version of an aged Person looks like <PersonSchema_v2,
"philip", 28>.

In Protocol Buffers (or Thrift), the assumption is that the application
knows what schema to read the data with.  The serialized versions look like
<string, 1, "Philip"> and <string, 1, "Philip", int, 2, 28>.  (Both the
types and the "tag numbers" are in the serialized data.)  Storage and RPC
mechanisms keep the tag numbers in tact (though they can use lossless
compression).

Both systems have logic about how to validly "evolve" schemas.  In the Avro,
you can add or remove fields, as long as you provide defaults for new ones.
 Fields have string names.  In PBs, you can add or remove fields, but
re-numbering a field is a big no-no.  In both systems, it's not atypical for
complicated systems to end up evolving various versions inside of the
protocols.  (For example, sometimes a system will start with only a single
update operation, and then there's a bulk update operation; the clients of
these systems tend to evolve so that they can degrade gracefully.)
That's some background on how the different systems work today.  For data
files, Avro is more efficient, because data files tend to be written all at
once, and therefore work with one schema.  For things like HBase, you end up
storing schema pointers, because your schema evolves over time.
I apologize for being dense, but I'm a bit unclear as to what your design
proposes.

>
> Here is a design that would improve things a bit more. Instead of
> serializing the object against its actual schema, let's say the application
> serializes against a union schema in which the object type's schema is a
> branch. As the application evolves, the application simply adds a branch to
> union. While reading the object, the application expects for one branch but
> the serialized object might be using another branch. As long as the
> branches
>
What happens if the application doesn't recognize the branch number?  If
you're a client of a get_person(id) call, and you were written when
Person_v1 was the only one in existence, Avro, today, would do just fine at
projecting Person_v2 down into Person_v1 for you.  That's because your
reader schema would be v1, and you'd read some data written with v2, and
those are compatible.  If you have a "version id", then it's hard to go do
compatibility of old readers reading new data.

> 3. Implementations that do not support revision, should ignore it and
> continue to work the way they work today. But I'm not sure what the current
> ones do when they encounter an attribute they don't understand.
>

This never happens.  Any time they read some data, they have the schema it
was written with, so they always encounter attributes that they understand.

I might not be understanding the ultimate use case you're struggling with.
 What case are you trying to make easier?

-- Philip