|
Thiruvalluvan M. G.
2010-09-21, 12:18
Philip Zeyliger
2010-09-21, 17:26
Doug Cutting
2010-09-21, 17:55
Thiruvalluvan M. G.
2010-09-22, 03:00
Thiruvalluvan M. G.
2010-09-22, 03:36
Thiruvalluvan M. G.
2010-09-22, 03:55
Philip Zeyliger
2010-09-22, 06:25
Thiruvalluvan M. G.
2010-09-22, 15:54
Doug Cutting
2010-09-22, 16:00
Scott Carey
2010-09-22, 16:29
|
-
A case for adding revision field to Avro schemaThiruvalluvan M. G. 2010-09-21, 12:18
Hi all,
Here is a use case. An application stores its objects after serialization using Avro. For each object-type the application uses a schema. That is all objects of a given type share the same schema. In order to deserialize an object, it should know the schema for that object. It optimizes storage by not storing the actual schema with each object, but a "pointer" to the schema corresponding to the object's types. The pointer itself should be stored outside the serialized binary. As the application evolves, the schema for some object types changes. The application doesn't need to do much if the new old and new schemas for an object type "match" as specified in the Avro specification. While loading, it uses the new schema for the object type to deserialize the object. If the object was originally serialized using the old schema, Avro resolves the schemas and the application transparently works as if the object was indeed serialized using the new schema. While storing the object, it stores the "pointer" to the new schema. One good thing about this design is that there is no need to do schema migration before a version change. The objects undergo schema change as they get read and written. If for some reason, the installation needs to go back to the old version, the objects modified by the new version in the interim will continue to be available provided the new and old schemas match in the opposite direction as well. Here is a design that would improve things a bit more. Instead of serializing the object against its actual schema, let's say the application serializes against a union schema in which the object type's schema is a branch. As the application evolves, the application simply adds a branch to union. While reading the object, the application expects for one branch but the serialized object might be using another branch. As long as the branches "match", Avro would resolve correctly. The current Java generic writer can correctly pick the branch as long as the object's schema is one of the branches. The nice thing about this improved design is that, there is no need to store a separate schema "pointer" along with the object. The "union-index" essentially acts as the pointer and it is internal to Avro. But there is one problem. As per the Avro specification, in order to "match" two schemas of the same type should have the same name. But two schemas with the same type and name cannot be branches within a union. Thus the design above will not work. If we modify the spec as follows, it would work: 1. Add a new optional string attribute called "revision" to all named schemas (record, enum, fixed). 2. We allow branches with union for the schemas of same type and name provided they both have revisions and revisions are different. (Not having revision attribute may be treated as having a null revision; but I'd rather be less permissive here) 3. Schemas match as per the current matching rules, even if the revisions do not match. 4. While writing the implementation should choose the branch that matches the type, name and revision. Caveats: 1. Though we can avoid storing the "pointer" to schema with each object, the application should somehow figure out the type of the object so as to associate it with the right union schema. We do not allow union of unions. The application can "flatten" all the unions for all the object types. It's not pretty. The application need not resort to this, if it can somehow associate the object type with the object, (e.g. from the location of the serialized binary). 2. I don't know the implication of this change in spec for implementations in languages other than Java. 3. Implementations that do not support revision, should ignore it and continue to work the way they work today. But I'm not sure what the current ones do when they encounter an attribute they don't understand. What do you think? Regards Thiru
-
Re: A case for adding revision field to Avro schemaPhilip Zeyliger 2010-09-21, 17:26
At this point, I think the application should manage its own schema store.
The distinction here is very fundamental and is the main reason why Avro's serialization is fundamentally different than Protocol Buffers (or Thrift, which, in this sense, are interchangeable.) Consider the following two schemas: // version 1 struct Person { string name; // id 1, for PB } // version 2 struct Person { string name; // id 1 for PB int age; // id 2 for PB } In Avro, there is an assumption that the schema is stored with the data. So, the serialized version of a Person looks like <PersonSchema_v1, "philip">. Implementations of storage (like AvroDataFile or the RPC handshake protocol) do their best to reduce the size of the schema storage. The serialized version of an aged Person looks like <PersonSchema_v2, "philip", 28>. In Protocol Buffers (or Thrift), the assumption is that the application knows what schema to read the data with. The serialized versions look like <string, 1, "Philip"> and <string, 1, "Philip", int, 2, 28>. (Both the types and the "tag numbers" are in the serialized data.) Storage and RPC mechanisms keep the tag numbers in tact (though they can use lossless compression). Both systems have logic about how to validly "evolve" schemas. In the Avro, you can add or remove fields, as long as you provide defaults for new ones. Fields have string names. In PBs, you can add or remove fields, but re-numbering a field is a big no-no. In both systems, it's not atypical for complicated systems to end up evolving various versions inside of the protocols. (For example, sometimes a system will start with only a single update operation, and then there's a bulk update operation; the clients of these systems tend to evolve so that they can degrade gracefully.) That's some background on how the different systems work today. For data files, Avro is more efficient, because data files tend to be written all at once, and therefore work with one schema. For things like HBase, you end up storing schema pointers, because your schema evolves over time. I apologize for being dense, but I'm a bit unclear as to what your design proposes. > > Here is a design that would improve things a bit more. Instead of > serializing the object against its actual schema, let's say the application > serializes against a union schema in which the object type's schema is a > branch. As the application evolves, the application simply adds a branch to > union. While reading the object, the application expects for one branch but > the serialized object might be using another branch. As long as the > branches > What happens if the application doesn't recognize the branch number? If you're a client of a get_person(id) call, and you were written when Person_v1 was the only one in existence, Avro, today, would do just fine at projecting Person_v2 down into Person_v1 for you. That's because your reader schema would be v1, and you'd read some data written with v2, and those are compatible. If you have a "version id", then it's hard to go do compatibility of old readers reading new data. > 3. Implementations that do not support revision, should ignore it and > continue to work the way they work today. But I'm not sure what the current > ones do when they encounter an attribute they don't understand. > This never happens. Any time they read some data, they have the schema it was written with, so they always encounter attributes that they understand. I might not be understanding the ultimate use case you're struggling with. What case are you trying to make easier? -- Philip
-
Re: A case for adding revision field to Avro schemaDoug Cutting 2010-09-21, 17:55
On 09/21/2010 05:18 AM, Thiruvalluvan M. G. wrote:
> Here is a design that would improve things a bit more. Instead of > serializing the object against its actual schema, let's say the application > serializes against a union schema in which the object type's schema is a > branch. As the application evolves, the application simply adds a branch to > union. Where would this union be stored? Is it only stored in the application, or is it stored with the data? I think it would be safest to somehow store it with the dataset, not in the application. > While reading the object, the application expects for one branch but > the serialized object might be using another branch. As long as the branches > "match", Avro would resolve correctly. The current Java generic writer can > correctly pick the branch as long as the object's schema is one of the > branches. The nice thing about this improved design is that, there is no > need to store a separate schema "pointer" along with the object. The > "union-index" essentially acts as the pointer and it is internal to Avro. It sounds like perhaps you're trying to optimize the size of the pointer from each stored instance to its schema. Is that correct? If so, then one might simply use a table for this. The application stores <pointer,record> pairs, but pointers need not be 16-byte checksums, but could be variable-length integers, starting from zero, that, for most applications, would always fit in a single byte. If schemas are stored with the dataset, then they could be stored as either: - the standalone single schema for every item in the dataset, which happens to be a union schema that's managed in a particular way, adding a new entry to the end each time an instance of a new schema is written; or - a table of schemas, whose indices are used as pointers in each datum, with entries added when no existing entry matches a datum to be written. The two are isomorphic. The former uses more Avro logic but feels more fragile. It's not really an arbitrary schema, but a union that takes advantage of the way that unions are serialized. The latter feels to me like a clearer description of a dataset. In either case the application must manage the table of schemas. The only operation that's simplified is that the top-level union dispatch at read and possibly write would use Avro logic instead of application logic. At write you might even be tempted to bypass Avro logic, since, in maintaining the union, you'd know the branch already, and searching for the right branch might be more costly. > But there is one problem. As per the Avro specification, in order to "match" > two schemas of the same type should have the same name. But two schemas with > the same type and name cannot be branches within a union. Thus the design > above will not work. The problem with multiple union branches of the same name only arises at write time, not at read time. So, if we allowed multiple branches of the same name in a top-level union at read time then this might work. A way to address this might be through aliases. If, in the union, each branch but the last, the record has a versioned name, i.e., the union is ["r0", "r1", .., "r"], then writing would work. If "r" then has aliases of ["r0", "r1", ..], then, at read-time, the union would be rewritten as ["r", "r", ...], but where each branch has a different definition. Currently this would fail due to the duplicate names, but if we changed it that so that, in the context of alias rewrites while reading, we permit duplicate names in a top-level union, then this could work as desired. Doug
-
RE: A case for adding revision field to Avro schemaThiruvalluvan M. G. 2010-09-22, 03:00
Thanks Philip for your crisp description of what happens with Thrift and PB.
I had assumed that the community knows the difference between those systems and Avro. your description should help those who don't know and be refresher for those who know. > What happens if the application doesn't recognize the branch number? If > you're a client of a get_person(id) call, and you were written when > Person_v1 was the only one in existence, Avro, today, would do just fine at > projecting Person_v2 down into Person_v1 for you. That's because your > reader schema would be v1, and you'd read some data written with v2, and > those are compatible. If you have a "version id", then it's hard to go do > compatibility of old readers reading new data. My proposal was: "3. Schemas match as per the current matching rules, even if the revisions do not match." That is, since Person_v2 and Person_v1 have the same name "Person" and different revisions v2 and v1, they would match according to the current rules. Thiru
-
RE: A case for adding revision field to Avro schemaThiruvalluvan M. G. 2010-09-22, 03:36
Thanks Doug.
> Where would this union be stored? Is it only stored in the application, > or is it stored with the data? I think it would be safest to somehow > store it with the dataset, not in the application. I agree. It should be stored along with the data. Without the schema it the data is meaningless. > It sounds like perhaps you're trying to optimize the size of the pointer > from each stored instance to its schema. Is that correct? Not really, we can optimize on size by using the table approach as you mention or other means. My motivation is to avoid the application having to interpret the first few bytes and Avro the rest. You capture my intent very precisely in a subsequent paragraph: > ... The only operation that's simplified > is that the top-level union dispatch at read and possibly write would > use Avro logic instead of application logic. ... The user can have a layer on top of Avro to insert these few bytes during write and interpret them during read. But my point is that if Avro can be made to do that, it is better and is available to every Avro user. > So, if we allowed multiple branches of > the same name in a top-level union at read time then this might work. Exactly. > A way to address this might be through aliases. If, in the union, each > branch but the last, the record has a versioned name, i.e., the union is > ["r0", "r1", .., "r"], then writing would work. If "r" then has aliases > of ["r0", "r1", ..], then, at read-time, the union would be rewritten as > ["r", "r", ...], but where each branch has a different definition. > Currently this would fail due to the duplicate names, but if we changed > it that so that, in the context of alias rewrites while reading, we > permit duplicate names in a top-level union, then this could work as > desired. This solves my problem. That is, the new matching rule would be: For resolving two named schemas, if the type of schemas are identical (enum, fixed or record) if the name of the writer-schema matches either name of the reader-schema or one of the aliases of reader-schema we try to match the contents of the schemas. By contents, I mean fields for the record and size for the fixed etc. Right now, we give up as soon as realize that the names do not match. This idea is functionally equivalent to the revision idea, but it is better because it rides on top of an existing proposal for aliases and does not introduce a new concept/construct. Thanks Thiru
-
RE: A case for adding revision field to Avro schemaThiruvalluvan M. G. 2010-09-22, 03:55
> ... but it is better
> because it rides on top of an existing proposal for aliases and does not > introduce a new concept/construct. I just noticed that the aliases we have presently is for field names not for schema names. Can we extend this to schema names as well? Thanks Thiru
-
Re: A case for adding revision field to Avro schemaPhilip Zeyliger 2010-09-22, 06:25
On Tue, Sep 21, 2010 at 8:00 PM, Thiruvalluvan M. G. <[EMAIL PROTECTED]>wrote:
> Thanks Philip for your crisp description of what happens with Thrift and > PB. > I had assumed that the community knows the difference between those systems > and Avro. your description should help those who don't know and be > refresher > for those who know. > > > What happens if the application doesn't recognize the branch number? If > > you're a client of a get_person(id) call, and you were written when > > Person_v1 was the only one in existence, Avro, today, would do just fine > at > > projecting Person_v2 down into Person_v1 for you. That's because your > > reader schema would be v1, and you'd read some data written with v2, and > > those are compatible. If you have a "version id", then it's hard to go > do > > compatibility of old readers reading new data. > > My proposal was: > > "3. Schemas match as per the current matching rules, even if the revisions > do not match." > > That is, since Person_v2 and Person_v1 have the same name "Person" and > different revisions v2 and v1, they would match according to the current > rules. > I'm beginning to understand your proposal a little bit better. What happens when the revisions aren't linear? (Or do we require them to be?) For example: Writer's Schema union: Person_a: (name) Person_b: (name, age) Reader's Schema union: Person_c: (age) Person_d: (age, school [default=""]) When "Person_b, Philip, 28" is written, what would a subsequent reader see? I'm worried that the semantics of reader and writer schemas are already complicated enough; adding in sets of schemas makes it even trickier. -- Philip
-
RE: A case for adding revision field to Avro schemaThiruvalluvan M. G. 2010-09-22, 15:54
> ... What happens
> when the revisions aren't linear? (Or do we require them to be?) I was not considering readers and writers being completely different software. My use case was two versions of a single application writing and reading its objects. Non linear revisions for persistence aren't common within a single application. > I'm worried that the semantics of reader and writer schemas are already > complicated enough; adding in sets of schemas makes it even trickier. I understand your concern. I agree the union schema could become really complicated over time, say after 5 revisions. We'll have to carry all the five revisions even if know that nobody needs all the fiver revisions at any given time. Given this, let me work with the "external" schema-id idea and gain some experience and then come back with a proposal. For now, let me withdraw my proposal. Thank you and Doug for the valuable feedback. Thiru
-
Re: A case for adding revision field to Avro schemaDoug Cutting 2010-09-22, 16:00
On 09/21/2010 08:55 PM, Thiruvalluvan M. G. wrote:
> I just noticed that the aliases we have presently is for field names not for > schema names. Can we extend this to schema names as well? Aliases are implemented for schema names. See TestSchema#testAliases(). http://avro.apache.org/docs/current/spec.html#Aliases Doug
-
Re: A case for adding revision field to Avro schemaScott Carey 2010-09-22, 16:29
On Sep 22, 2010, at 8:54 AM, Thiruvalluvan M. G. wrote: > >> I'm worried that the semantics of reader and writer schemas are already >> complicated enough; adding in sets of schemas makes it even trickier. > > I understand your concern. I agree the union schema could become really > complicated over time, say after 5 revisions. We'll have to carry all the > five revisions even if know that nobody needs all the fiver revisions at any > given time. > I'd also worry about the size. My schemas are 1k to 8k in JSON size. Keeping a copy of each revision in union branches is something I'd avoid with any larger schema. I'm fine with storing the schema as metadata with my data for most uses. There are a few tricky ones where something like your proposal would be helpful but I'm not sure overloading Unions for it is the right thing. For example, what if you want to serialize data into a browser cookie using Avro? There is no 'store the schema with the data' option here, period. You have to be able to identify what schema was used via a version identifier. The application can manage that, or Avro can. At minimum we should strive for documentation and advice on the issue. > Given this, let me work with the "external" schema-id idea and gain some > experience and then come back with a proposal. > > For now, let me withdraw my proposal. > > Thank you and Doug for the valuable feedback. > > Thiru > > |