Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro, mail # user - references to other schemas

Jay Kreps 2010-05-02, 19:18
Scott Carey 2010-05-02, 20:45
Doug Cutting 2010-05-03, 17:03
Jeff Hodges 2010-05-03, 17:37
Scott Carey 2010-05-03, 17:48
Copy link to this message
Re: references to other schemas
Jay Kreps 2010-05-03, 18:47
Yes, agreed you need a full schema for the resulting message with no
external references. My proposal is just pre-processor support that
does the expansion based on unresolved names as you describe. I think
this is better than explicit includes or URLs directly in the schemas
(after all the fully qualified name is the name system used to refer
to types not a URL). I think you need this both for specific and
generic for it to be useful (it shouldn't just work for one--you
should be able to load the fragments, resolve them and use them as
schemas normally without the compiler). Without actually thinking it
all through, the idea would be introduce

class Schemas {
   public List<Schema> parse(String...jsons){...}

Plus of course parse methods with File, InputStream, etc. This would
be need to resolve inter-referencing schemas and expand them based on
type references. The inputs would be fragments and the schemas you get
back will be fully resolved. Once you have the full Schema it would be
no different than if you had manually expanded the whole thing.

The specific compiler would then be changed to use this class to load
schemas and the arguments would be changed from
   input output_dir
   input1 input2 ... inputN output_dir

It would probably make sense to support directories as well as files for inputs.

The use case this addresses is the common case of having shared
headers, fields, or other includes that get used in a standard way
across a large number of messages.

It is worth thinking this proposal through, since for an organization
that needs to maintain a large set of messages, how they interconnect
and what dependencies there are is quite critical.


On Mon, May 3, 2010 at 10:48 AM, Scott Carey <[EMAIL PROTECTED]> wrote:
> On May 3, 2010, at 10:03 AM, Doug Cutting wrote:
>> Scott Carey wrote:
>>> There has been talk that AvroGen would handle features like this (as well as many others) in time.  However this is one that should probably be addressed at the JSON level regardless of the future direction of AvroGen.
>> Note that JSON schemas and protocols need to be standalone, containing
>> the full lexical closure of schemas referenced, when they are included
>> in data files and exchanged in RPC handshakes without reference to
>> external data.  Thus I am reluctant to add a JSON syntax for file
>> inclusion.  Rather, I think a pre-processor is appropriate.  The
>> pre-processor would not be run on schemas included in files or exchanged
>> in RPC handshakes, but would be run for schemas read from files.
> Exactly.  I don't think we shouldn't change the JSON syntax by adding references or includes.
> We should just make the SpecificCompiler capable of reading a collection of files and figuring out how to compile them when there is not full lexical closure in a .avsc file.
> File formats and RPC's have much stricter requirements than the SpecificCompiler.
>> I have experimented with using the m4 pre-processor for this purpose,
>> and found it a bit awkward.  Perhaps someone can develop macros for m4
>> that make it palatable, or perhaps we can develop a custom pre-processor
>> for JSON.
>> We might exploit otherwise-illegal JSON syntax, like backquotes, for
>> pre-processor directives.  An include might look something like:
>> {"protocol": "org.foo.BarProtocol",
>>  "types": [
>>    `include org.foo.Bar`,
>>     ...
>>   ]
>> }
> Rather than use a preprocessor, Is it possible to have the SpecificCompiler search the other files in the set for types that can't be found in the current file?  The result will be SpecificRecord objects that have their $SCHEMA field populated with a schema that has full lexical closure.
> Essentially, if given two files:
> IpTypes.avsc --
> [{"name": "com.somewhere.avro.IPV4", "type": "fixed", "size":4},
> {"name": "com.somewhere.avro.IPV6", "type": "fixed", "size":16}]
> MyRecord.avsc --
> {"name": "com.somewhere.avro.MyRecord", "type": "record", "fields": [