Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Nested schema issue (with "munged" invalid schema)


+
Peter Cameron 2012-05-01, 16:55
Copy link to this message
-
Re: Nested schema issue (with "munged" invalid schema)


On 5/1/12 9:55 AM, "Peter Cameron" <[EMAIL PROTECTED]> wrote:

> I'm having a problem with nesting schemas. A very brief overview of why we're
> using Avro (successfully so far) is:
>
> o code generation not required
> o small binary format
> o dynamic use of schemas at runtime
>
> We're doing a flavour of RPC, and the reason we're not using Avro's IDL and
> flavour of RPC is because the endpoint is not necessarily a Java platform (C#
> and Java for our purposes), and only the Java implementation of Avro has RPC.
> Hence no Avro RPC for us.
>
> I'm aware that Avro doesn't import nested schemas out of the box. We need that
> functionality as we're exposed to schemas over which we have no control, and
> in the interests of maintainability, these schemas are nicely partitioned and
> are referenced as types from within other schemas. So, for example, a address
> schema refers to a some.domain.location object by having a field of type
> "some.domain.location". Note that our runtime has no knowledge of any
> some.domain package (e.g. address or location objects). Only the endpoints
> know about some.domain. (A layer at our endpoint runtime serialises any
> unknown i.e. non-primitive objects as bytestreams.)
>
> I implemented a schema cache which intelligently imports schemas on the fly,
> so adding the address schema to the cache, automatically adds the location
> schema that it refers to. The cache uses Avro's schema to parse an added
> schema, catches parse exceptions, looks at the exception message to see
> whether or not the error is due to a missing or undefined type, and thus goes
> off to import the needed schema. Brittle, I know, but no other way for us. We
> need this functionality, and nothing else comes close to Avro.
>
> So far so good, until today when I hit a corner case.
>
> Say I have an address object that has two fields, called position1 and
> position2. If position1 and position2 are non-primitive types, then the
> address schema doesn't parse so presumably is an invalid Avro schema. The
> error concerns redefining the location type. Here's the example:
>
> location schema
> ==============
>
> {
>     "name": "location",
>     "type": "record",
>     "namespace" : "some.domain",
>     "fields" :
>     [
>         {
>             "name": "latitude",
>             "type": "float"
>         },
>         {
>             "name": "longitude",
>             "type": "float"
>         }
>     ]
> }
>
> address schema
> ==============
>
> {
>     "name": "address",
>     "type": "record",
>     "namespace" : "some.domain",
>     "fields" :
>     [
>         {
>             "name": "street",
>             "type": "string"
>         },
>         {
>             "name": "city",
>             "type": "string"
>         },
>         {
>             "name": "position1",
>             "type": "some.domain.location"
>         },
>         {
>             "name": "position2",
>             "type": "some.domain.location"
>         }
>     ]
> }
>
>
> Now, an answer of having a list of positions as a field is not an answer for
> us, as we need to solve the general issue of a schema with more than one
> instance of the same nested type i.e. my problem is not with an address or
> location schema.
>
> The problematic schema constructed by my schema cache is:
>
> {
>     "name": "address2",
>     "type": "record",
>     "namespace" : "some.domain",
>     "fields" :
>     [
>         {
>             "name": "street",
>             "type": "string"
>         },
>         {
>             "name": "city",
>             "type": "string"
>         },
>         {
>             "name": "position1",
>             "type":
> {"type":"record","name":"location","namespace":"some.domain","fields":[{"name"
> :"latitude","type":"float"},{"name":"longitude","type":"float"}]}
>         },
>         {
>             "name": "position2",
>             "type":
> {"type":"record","name":"location","namespace":"some.domain","fields":[{"name"

The second time that "location" is used, it should be used by reference, and
not re-defined.  I believe that
  "name":"position2"
  "type":"some.domain.location" should work, provided the type
"some.domain.location" is defined previously in the schema, as it is in
"position1".

+
Peter Cameron 2012-05-02, 09:26
+
Nick Palmer 2012-05-30, 21:14
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB