|
|
-
Nested schema issue (with "munged" invalid schema)
Peter Cameron 2012-05-01, 16:55
I'm having a problem with nesting schemas. A very brief overview of why we're using Avro (successfully so far) is:
o code generation not required o small binary format o dynamic use of schemas at runtime
We're doing a flavour of RPC, and the reason we're not using Avro's IDL and flavour of RPC is because the endpoint is not necessarily a Java platform (C# and Java for our purposes), and only the Java implementation of Avro has RPC. Hence no Avro RPC for us.
I'm aware that Avro doesn't import nested schemas out of the box. We need that functionality as we're exposed to schemas over which we have no control, and in the interests of maintainability, these schemas are nicely partitioned and are referenced as types from within other schemas. So, for example, a address schema refers to a some.domain.location object by having a field of type "some.domain.location". Note that our runtime has no knowledge of any some.domain package (e.g. address or location objects). Only the endpoints know about some.domain. (A layer at our endpoint runtime serialises any unknown i.e. non-primitive objects as bytestreams.)
I implemented a schema cache which intelligently imports schemas on the fly, so adding the address schema to the cache, automatically adds the location schema that it refers to. The cache uses Avro's schema to parse an added schema, catches parse exceptions, looks at the exception message to see whether or not the error is due to a missing or undefined type, and thus goes off to import the needed schema. Brittle, I know, but no other way for us. We need this functionality, and nothing else comes close to Avro.
So far so good, until today when I hit a corner case.
Say I have an address object that has two fields, called position1 and position2. If position1 and position2 are non-primitive types, then the address schema doesn't parse so presumably is an invalid Avro schema. The error concerns redefining the location type. Here's the example:
location schema ============= { "name": "location", "type": "record", "namespace" : "some.domain", "fields" : [ { "name": "latitude", "type": "float" }, { "name": "longitude", "type": "float" } ] }
address schema ============= { "name": "address", "type": "record", "namespace" : "some.domain", "fields" : [ { "name": "street", "type": "string" }, { "name": "city", "type": "string" }, { "name": "position1", "type": "some.domain.location" }, { "name": "position2", "type": "some.domain.location" } ] } Now, an answer of having a list of positions as a field is not an answer for us, as we need to solve the general issue of a schema with more than one instance of the same nested type i.e. my problem is not with an address or location schema.
The problematic schema constructed by my schema cache is:
{ "name": "address2", "type": "record", "namespace" : "some.domain", "fields" : [ { "name": "street", "type": "string" }, { "name": "city", "type": "string" }, { "name": "position1", "type": {"type":"record","name":"location","namespace":"some.domain","fields":[{"name":"latitude","type":"float"},{"name":"longitude","type":"float"}]} }, { "name": "position2", "type": {"type":"record","name":"location","namespace":"some.domain","fields":[{"name":"latitude","type":"float"},{"name":"longitude","type":"float"}]} } ] } Can this be done? This is potentially a blocker for us.
cheers, Peter
-
Re: Nested schema issue (with "munged" invalid schema)
Scott Carey 2012-05-01, 23:24
On 5/1/12 9:55 AM, "Peter Cameron" <[EMAIL PROTECTED]> wrote:
> I'm having a problem with nesting schemas. A very brief overview of why we're > using Avro (successfully so far) is: > > o code generation not required > o small binary format > o dynamic use of schemas at runtime > > We're doing a flavour of RPC, and the reason we're not using Avro's IDL and > flavour of RPC is because the endpoint is not necessarily a Java platform (C# > and Java for our purposes), and only the Java implementation of Avro has RPC. > Hence no Avro RPC for us. > > I'm aware that Avro doesn't import nested schemas out of the box. We need that > functionality as we're exposed to schemas over which we have no control, and > in the interests of maintainability, these schemas are nicely partitioned and > are referenced as types from within other schemas. So, for example, a address > schema refers to a some.domain.location object by having a field of type > "some.domain.location". Note that our runtime has no knowledge of any > some.domain package (e.g. address or location objects). Only the endpoints > know about some.domain. (A layer at our endpoint runtime serialises any > unknown i.e. non-primitive objects as bytestreams.) > > I implemented a schema cache which intelligently imports schemas on the fly, > so adding the address schema to the cache, automatically adds the location > schema that it refers to. The cache uses Avro's schema to parse an added > schema, catches parse exceptions, looks at the exception message to see > whether or not the error is due to a missing or undefined type, and thus goes > off to import the needed schema. Brittle, I know, but no other way for us. We > need this functionality, and nothing else comes close to Avro. > > So far so good, until today when I hit a corner case. > > Say I have an address object that has two fields, called position1 and > position2. If position1 and position2 are non-primitive types, then the > address schema doesn't parse so presumably is an invalid Avro schema. The > error concerns redefining the location type. Here's the example: > > location schema > ============== > > { > "name": "location", > "type": "record", > "namespace" : "some.domain", > "fields" : > [ > { > "name": "latitude", > "type": "float" > }, > { > "name": "longitude", > "type": "float" > } > ] > } > > address schema > ============== > > { > "name": "address", > "type": "record", > "namespace" : "some.domain", > "fields" : > [ > { > "name": "street", > "type": "string" > }, > { > "name": "city", > "type": "string" > }, > { > "name": "position1", > "type": "some.domain.location" > }, > { > "name": "position2", > "type": "some.domain.location" > } > ] > } > > > Now, an answer of having a list of positions as a field is not an answer for > us, as we need to solve the general issue of a schema with more than one > instance of the same nested type i.e. my problem is not with an address or > location schema. > > The problematic schema constructed by my schema cache is: > > { > "name": "address2", > "type": "record", > "namespace" : "some.domain", > "fields" : > [ > { > "name": "street", > "type": "string" > }, > { > "name": "city", > "type": "string" > }, > { > "name": "position1", > "type": > {"type":"record","name":"location","namespace":"some.domain","fields":[{"name" > :"latitude","type":"float"},{"name":"longitude","type":"float"}]} > }, > { > "name": "position2", > "type": > {"type":"record","name":"location","namespace":"some.domain","fields":[{"name"
The second time that "location" is used, it should be used by reference, and not re-defined. I believe that "name":"position2" "type":"some.domain.location" should work, provided the type "some.domain.location" is defined previously in the schema, as it is in "position1".
-
Re: Nested schema issue (with "munged" invalid schema)
Peter Cameron 2012-05-02, 09:26
On 02/05/2012 00:24, Scott Carey wrote:
> The second time that "location" is used, it should be used by > reference, and not re-defined. I believe that > "name":"position2" > "type":"some.domain.location" should work, provided the type > "some.domain.location" is defined previously in the schema, as it is > in "position1". > >
Thanks, that did the job. Obvious, I suppose when you think about it!
We're attempting to use Avro to define some specifications that we're putting forward (as a set of Avro schemas) to a standards body on which we have a presence.
With Avro as it stands right now, that specification would consist of a set of schemas plus a layer that you have to implement on top of Avro to manage schemas. This isn't ideal. Code (in Java /C# or whatever) should not form part of our spec. I like the idea of the parser having callbacks on specific events, such as "type not defined". That would provide a lot of what we need, but not all.
For our particular scenarios, we don't have access to the domain objects defined by the schemas at runtime. In other words, we are entirely schema-driven (apart from some code generation for our core functionality). Class types of objects at runtime are not something we're interested in -- the actual type defined in the schema is. So, for the location example, we don't have a location object, we have a location schema and therefore it's QName i.e. some.domain.location.
How we proceed in our thin veneer on top of Avro is to serialise any non-primitive (i.e. not one of Avro's built-in types), as an array of bytes. The type information (some.domain.location) is also serialised as part of our schema, so all the information is there to reconstruct a location object at the *endpoint*.
Given that the schema is all there is for us, we've also had to custom-code a type for collections e.g. a list of locations is typed as "list<some.domain.location>".
Any comments or thoughts?
Peter
-
Re: Nested schema issue (with "munged" invalid schema)
Nick Palmer 2012-05-30, 21:14
You cannot define the same type twice within the same schema so you need to change your "munge" step to produce the following:
{ "name": "address2", "type": "record", "namespace" : "some.domain", "fields" : [ { "name": "street", "type": "string" }, { "name": "city", "type": "string" }, { "name": "position1", "type": {"type":"record","name":"location","namespace":"some.domain","fields":[{"name":"latitude","type":"float"},{"name":"longitude","type":"float"}]} }, { "name": "position2", "type": "some.domain.location" } ] }
~ Nick
On May 1, 2012, at 6:55 PM, Peter Cameron wrote:
> I'm having a problem with nesting schemas. A very brief overview of why we're using Avro (successfully so far) is: > > o code generation not required > o small binary format > o dynamic use of schemas at runtime > > We're doing a flavour of RPC, and the reason we're not using Avro's IDL and flavour of RPC is because the endpoint is not necessarily a Java platform (C# and Java for our purposes), and only the Java implementation of Avro has RPC. Hence no Avro RPC for us. > > I'm aware that Avro doesn't import nested schemas out of the box. We need that functionality as we're exposed to schemas over which we have no control, and in the interests of maintainability, these schemas are nicely partitioned and are referenced as types from within other schemas. So, for example, a address schema refers to a some.domain.location object by having a field of type "some.domain.location". Note that our runtime has no knowledge of any some.domain package (e.g. address or location objects). Only the endpoints know about some.domain. (A layer at our endpoint runtime serialises any unknown i.e. non-primitive objects as bytestreams.) > > I implemented a schema cache which intelligently imports schemas on the fly, so adding the address schema to the cache, automatically adds the location schema that it refers to. The cache uses Avro's schema to parse an added schema, catches parse exceptions, looks at the exception message to see whether or not the error is due to a missing or undefined type, and thus goes off to import the needed schema. Brittle, I know, but no other way for us. We need this functionality, and nothing else comes close to Avro. > > So far so good, until today when I hit a corner case. > > Say I have an address object that has two fields, called position1 and position2. If position1 and position2 are non-primitive types, then the address schema doesn't parse so presumably is an invalid Avro schema. The error concerns redefining the location type. Here's the example: > > location schema > ============== > > { > "name": "location", > "type": "record", > "namespace" : "some.domain", > "fields" : > [ > { > "name": "latitude", > "type": "float" > }, > { > "name": "longitude", > "type": "float" > } > ] > } > > address schema > ============== > > { > "name": "address", > "type": "record", > "namespace" : "some.domain", > "fields" : > [ > { > "name": "street", > "type": "string" > }, > { > "name": "city", > "type": "string" > }, > { > "name": "position1", > "type": "some.domain.location" > }, > { > "name": "position2", > "type": "some.domain.location" > } > ] > } > > > Now, an answer of having a list of positions as a field is not an answer for us, as we need to solve the general issue of a schema with more than one instance of the same nested type i.e. my problem is not with an address or location schema. > > The problematic schema constructed by my schema cache is:
|
|