Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro, mail # user - Nested schema issue (with "munged" invalid schema)


+
Peter Cameron 2012-05-01, 16:55
Copy link to this message
-
Re: Nested schema issue (with "munged" invalid schema)
Scott Carey 2012-05-01, 23:24


On 5/1/12 9:55 AM, "Peter Cameron" <[EMAIL PROTECTED]> wrote:

> I'm having a problem with nesting schemas. A very brief overview of why we're
> using Avro (successfully so far) is:
>
> o code generation not required
> o small binary format
> o dynamic use of schemas at runtime
>
> We're doing a flavour of RPC, and the reason we're not using Avro's IDL and
> flavour of RPC is because the endpoint is not necessarily a Java platform (C#
> and Java for our purposes), and only the Java implementation of Avro has RPC.
> Hence no Avro RPC for us.
>
> I'm aware that Avro doesn't import nested schemas out of the box. We need that
> functionality as we're exposed to schemas over which we have no control, and
> in the interests of maintainability, these schemas are nicely partitioned and
> are referenced as types from within other schemas. So, for example, a address
> schema refers to a some.domain.location object by having a field of type
> "some.domain.location". Note that our runtime has no knowledge of any
> some.domain package (e.g. address or location objects). Only the endpoints
> know about some.domain. (A layer at our endpoint runtime serialises any
> unknown i.e. non-primitive objects as bytestreams.)
>
> I implemented a schema cache which intelligently imports schemas on the fly,
> so adding the address schema to the cache, automatically adds the location
> schema that it refers to. The cache uses Avro's schema to parse an added
> schema, catches parse exceptions, looks at the exception message to see
> whether or not the error is due to a missing or undefined type, and thus goes
> off to import the needed schema. Brittle, I know, but no other way for us. We
> need this functionality, and nothing else comes close to Avro.
>
> So far so good, until today when I hit a corner case.
>
> Say I have an address object that has two fields, called position1 and
> position2. If position1 and position2 are non-primitive types, then the
> address schema doesn't parse so presumably is an invalid Avro schema. The
> error concerns redefining the location type. Here's the example:
>
> location schema
> ==============
>
> {
>     "name": "location",
>     "type": "record",
>     "namespace" : "some.domain",
>     "fields" :
>     [
>         {
>             "name": "latitude",
>             "type": "float"
>         },
>         {
>             "name": "longitude",
>             "type": "float"
>         }
>     ]
> }
>
> address schema
> ==============
>
> {
>     "name": "address",
>     "type": "record",
>     "namespace" : "some.domain",
>     "fields" :
>     [
>         {
>             "name": "street",
>             "type": "string"
>         },
>         {
>             "name": "city",
>             "type": "string"
>         },
>         {
>             "name": "position1",
>             "type": "some.domain.location"
>         },
>         {
>             "name": "position2",
>             "type": "some.domain.location"
>         }
>     ]
> }
>
>
> Now, an answer of having a list of positions as a field is not an answer for
> us, as we need to solve the general issue of a schema with more than one
> instance of the same nested type i.e. my problem is not with an address or
> location schema.
>
> The problematic schema constructed by my schema cache is:
>
> {
>     "name": "address2",
>     "type": "record",
>     "namespace" : "some.domain",
>     "fields" :
>     [
>         {
>             "name": "street",
>             "type": "string"
>         },
>         {
>             "name": "city",
>             "type": "string"
>         },
>         {
>             "name": "position1",
>             "type":
> {"type":"record","name":"location","namespace":"some.domain","fields":[{"name"
> :"latitude","type":"float"},{"name":"longitude","type":"float"}]}
>         },
>         {
>             "name": "position2",
>             "type":
> {"type":"record","name":"location","namespace":"some.domain","fields":[{"name"

The second time that "location" is used, it should be used by reference, and
not re-defined.  I believe that
  "name":"position2"
  "type":"some.domain.location" should work, provided the type
"some.domain.location" is defined previously in the schema, as it is in
"position1".

+
Peter Cameron 2012-05-02, 09:26
+
Nick Palmer 2012-05-30, 21:14