Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro, mail # dev - Field level reference across Avro Schema


+
hiteshpahuja 2014-03-19, 03:18
+
Sean Busbey 2014-03-19, 04:04
+
hiteshpahuja 2014-03-19, 04:31
Copy link to this message
-
Re: Field level reference across Avro Schema
Sean Busbey 2014-03-19, 06:56
On Tue, Mar 18, 2014 at 11:31 PM, hiteshpahuja <[EMAIL PROTECTED]>wrote:
Ah, yes. Currently the Avro specification only allows the type of a field
to be a named type or a schema. ATM, named types are only Record, Enum, and
Fixed[1].

That does mean that if one of the particular fields of your CommonData is
itself a named type you could reference it, but the usage is awkward.

Expanding named types to include record fields would be an incompatible
change, because it might cause existing schemas to break. Specifically, if
a schema had a field that had the same name as some other named type in the
same namespace the collision would result in an error. If this is something
you want to work out the details on, you should file a jira.

There are a few things you could do now, but the one I'd recommend is to
rely on alias support.

e.g. Given some example customized records

{"namespace": "recordData",
 "type": "record",
 "name": "CustomizedRecordDataFoo",
 "fields": [
     {"name": "recordId", "type": "string"},
     {"name":  "foo",  "type": ["string", "null"]}
     ]
}

{"namespace": "recordData",
 "type": "record",
 "name": "CustomizedRecordDataBar",
 "fields": [
     {"name": "bar", "type": "string"},
     {"name": "recordDate",  "type": "string"}
     ]
}

and then when you want to make use of common, you define a reader schema

{"namespace": "recordData",
  "type": "record",
  "name": "CommonData",
  "aliases": ["CustomizedRecordDataFoo", "CustomizedRecordDataBar"],
  "fields" : [
     {"name": "recordId", "type": ["null", "string"], "default": null},
     {"name": "recordDate",  "type": ["null", "string"], "default": null},
     {"name": "recordPrice", "type": ["null", "int"], "default": null},
     {"name": "customer", "type": ["null", "string"], "default": null}
  ]
}

Using that reader should allow you to go over records of both the
customized versions, with whichever fields are present being set.

Issues to consider in this approach

1) You have to make sure the schema of the individual fields resolve
according to spec rules[2]. The simplified version of this is to make sure
they're both string, int, or whatever (with the one in Common nullable).

2) If the field in the customized record is nullable, you won't be able to
tell the difference between the field not being present and being null. You
can mitigate this by using a known placeholder default instead.

If you can stand some storage overhead, you can deal with the first issue
by using the all-nullable CommonData record in all of the customized
records and then only setting those fields you actually want used.

-Sean

[1]: http://avro.apache.org/docs/1.7.6/spec.html#Names
[2]: http://avro.apache.org/docs/1.7.6/spec.html#Schema+Resolution