|
|
-
references to other schemas
Jay Kreps 2010-05-02, 19:18
I want to have a shared type schema which would be used by 50 or so messages (say a type Header defined in a single place that all messages would use), and I can't seem to find a way to do this (though I may just have missed it).
This could be done either by an "import" statement in the .avsc file as protocol buffers does, but I do not think that really makes sense in a world of non-statically compiled schemas. Probably a better way is just to make a type "Xyz" resolve to the schema of that type. Then just to open up these methods, and make the SpecificCompiler take lots of files, resolve all the inter-references, and then generate a bunch of classes instead of a single file. The resulting schema would have no reference to Xyz, but rather would directly include the schema for Xyz in its place.
This looks like it can *almost* be done using some internal private methods:
/* this package protected method parses wrt the given names. Header could be given here if I understand correctly */ Schema.parse(JsonNode schema, Names names)
/* compile multiple schemas into multiple files*/ s = SpecificCompiler() s.enqueue(header) s.enqueue(schemaUsingHeader) outputFiles = s.compile()
Is this kind of thing handled in some other way I have just missed? If not any objection to a patch that opens up these methods and adds options to SpecificCompiler to jointly compile a bunch of files all at once? Perhaps this is already in flight?
-Jay
+
Jay Kreps 2010-05-02, 19:18
-
Re: references to other schemas
Scott Carey 2010-05-02, 20:45
On May 2, 2010, at 12:18 PM, Jay Kreps wrote:
> I want to have a shared type schema which would be used by 50 or so > messages (say a type Header defined in a single place that all > messages would use), and I can't seem to find a way to do this (though > I may just have missed it). > > This could be done either by an "import" statement in the .avsc file > as protocol buffers does, but I do not think that really makes sense > in a world of non-statically compiled schemas. Probably a better way > is just to make a type "Xyz" resolve to the schema of that type. Then > just to open up these methods, and make the SpecificCompiler take lots > of files, resolve all the inter-references, and then generate a bunch > of classes instead of a single file. The resulting schema would have > no reference to Xyz, but rather would directly include the schema for > Xyz in its place. > > This looks like it can *almost* be done using some internal private methods: > > /* this package protected method parses wrt the given names. Header > could be given here if I understand correctly */ > Schema.parse(JsonNode schema, Names names) > > /* compile multiple schemas into multiple files*/ > s = SpecificCompiler() > s.enqueue(header) > s.enqueue(schemaUsingHeader) > outputFiles = s.compile() > > Is this kind of thing handled in some other way I have just missed? If > not any objection to a patch that opens up these methods and adds > options to SpecificCompiler to jointly compile a bunch of files all at > once? Perhaps this is already in flight? >
It is not in flight to my knowledge, and it would certainly make the SpecificCompiler easier to use. I would welcome such a contribution.
Being able to compile a collection of *.avsc and *.avpr files and resolve types across them would be great.
There has been talk that AvroGen would handle features like this (as well as many others) in time. However this is one that should probably be addressed at the JSON level regardless of the future direction of AvroGen.
-Scott
> -Jay
+
Scott Carey 2010-05-02, 20:45
-
Re: references to other schemas
Doug Cutting 2010-05-03, 17:03
Scott Carey wrote: > There has been talk that AvroGen would handle features like this (as well as many others) in time. However this is one that should probably be addressed at the JSON level regardless of the future direction of AvroGen.
Note that JSON schemas and protocols need to be standalone, containing the full lexical closure of schemas referenced, when they are included in data files and exchanged in RPC handshakes without reference to external data. Thus I am reluctant to add a JSON syntax for file inclusion. Rather, I think a pre-processor is appropriate. The pre-processor would not be run on schemas included in files or exchanged in RPC handshakes, but would be run for schemas read from files.
I have experimented with using the m4 pre-processor for this purpose, and found it a bit awkward. Perhaps someone can develop macros for m4 that make it palatable, or perhaps we can develop a custom pre-processor for JSON.
We might exploit otherwise-illegal JSON syntax, like backquotes, for pre-processor directives. An include might look something like:
{"protocol": "org.foo.BarProtocol", "types": [ `include org.foo.Bar`, ... ] }
Also note that a protocol file (.avpr) need not actually define any messages but can be used to define a set of types that reference one another. This is a stopgap, but a useful one.
Doug
+
Doug Cutting 2010-05-03, 17:03
-
Re: references to other schemas
Jeff Hodges 2010-05-03, 17:37
Backticks are allowed inside of strings, though, so whatever preprocessor was used would have to have some understanding of JSON. This reduces the preprocessor options for that.
I'm fairly neutral on the idea of composite schemas, overall. The biggest problem I have is that JSON has no standard way of referring to URLs (in the HTML5 sense) and they seem to be the best way to do this.
On schema read, the references could be loaded once and kept that way in order to have a complete schema on RPC and datafile write. Basically, we would say references will be used on read, but not on write. -- Jeff
On May 3, 2010 10:03 AM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:
Scott Carey wrote: > > There has been talk that AvroGen would handle features like this (as well as ...
Note that JSON schemas and protocols need to be standalone, containing the full lexical closure of schemas referenced, when they are included in data files and exchanged in RPC handshakes without reference to external data. Thus I am reluctant to add a JSON syntax for file inclusion. Rather, I think a pre-processor is appropriate. The pre-processor would not be run on schemas included in files or exchanged in RPC handshakes, but would be run for schemas read from files.
I have experimented with using the m4 pre-processor for this purpose, and found it a bit awkward. Perhaps someone can develop macros for m4 that make it palatable, or perhaps we can develop a custom pre-processor for JSON.
We might exploit otherwise-illegal JSON syntax, like backquotes, for pre-processor directives. An include might look something like:
{"protocol": "org.foo.BarProtocol", "types": [ `include org.foo.Bar`, ... ] }
Also note that a protocol file (.avpr) need not actually define any messages but can be used to define a set of types that reference one another. This is a stopgap, but a useful one.
Doug
+
Jeff Hodges 2010-05-03, 17:37
-
Re: references to other schemas
Scott Carey 2010-05-03, 17:48
On May 3, 2010, at 10:03 AM, Doug Cutting wrote:
> Scott Carey wrote: >> There has been talk that AvroGen would handle features like this (as well as many others) in time. However this is one that should probably be addressed at the JSON level regardless of the future direction of AvroGen. > > Note that JSON schemas and protocols need to be standalone, containing > the full lexical closure of schemas referenced, when they are included > in data files and exchanged in RPC handshakes without reference to > external data. Thus I am reluctant to add a JSON syntax for file > inclusion. Rather, I think a pre-processor is appropriate. The > pre-processor would not be run on schemas included in files or exchanged > in RPC handshakes, but would be run for schemas read from files.
Exactly. I don't think we shouldn't change the JSON syntax by adding references or includes.
We should just make the SpecificCompiler capable of reading a collection of files and figuring out how to compile them when there is not full lexical closure in a .avsc file. File formats and RPC's have much stricter requirements than the SpecificCompiler.
> > I have experimented with using the m4 pre-processor for this purpose, > and found it a bit awkward. Perhaps someone can develop macros for m4 > that make it palatable, or perhaps we can develop a custom pre-processor > for JSON. > > We might exploit otherwise-illegal JSON syntax, like backquotes, for > pre-processor directives. An include might look something like: > > {"protocol": "org.foo.BarProtocol", > "types": [ > `include org.foo.Bar`, > ... > ] > } >
Rather than use a preprocessor, Is it possible to have the SpecificCompiler search the other files in the set for types that can't be found in the current file? The result will be SpecificRecord objects that have their $SCHEMA field populated with a schema that has full lexical closure.
Essentially, if given two files: IpTypes.avsc --
[{"name": "com.somewhere.avro.IPV4", "type": "fixed", "size":4}, {"name": "com.somewhere.avro.IPV6", "type": "fixed", "size":16}]
MyRecord.avsc --
{"name": "com.somewhere.avro.MyRecord", "type": "record", "fields": [ {"name": "hostname", "type": "string"}, {"name": "IP", "type": [ "IPV4", "IPV6" ]} ]}
The SpecificCompiler could compile MyRecord.avsc if concurrently given IpTypes.avsc to resolve the "IPV4" and "IPV6" unknown references. Perhaps it could also compile if it is aware of a SpecificRecord Java class that has an appropriate schema. A preprocessor would be tricky to do this especially in a namespace-appropriate way, and would not be able to support integration with already made SpecificRecord classes.
Perhaps IPV4 and IPV6 are already compiled SpecificRecord classes in jar "CommonTypes.jar" -- SpecificCompiler could run with those in its classpath and a directive to look for valid types in its classpath in addition to the files.
The MyRecord.avsc file above does not contain a fully valid Avro schema, so perhaps we could denote this with a different file extension.
> Also note that a protocol file (.avpr) need not actually define any > messages but can be used to define a set of types that reference one > another. This is a stopgap, but a useful one. > > Doug
+
Scott Carey 2010-05-03, 17:48
-
Re: references to other schemas
Jay Kreps 2010-05-03, 18:47
Yes, agreed you need a full schema for the resulting message with no external references. My proposal is just pre-processor support that does the expansion based on unresolved names as you describe. I think this is better than explicit includes or URLs directly in the schemas (after all the fully qualified name is the name system used to refer to types not a URL). I think you need this both for specific and generic for it to be useful (it shouldn't just work for one--you should be able to load the fragments, resolve them and use them as schemas normally without the compiler). Without actually thinking it all through, the idea would be introduce
class Schemas { public List<Schema> parse(String...jsons){...} }
Plus of course parse methods with File, InputStream, etc. This would be need to resolve inter-referencing schemas and expand them based on type references. The inputs would be fragments and the schemas you get back will be fully resolved. Once you have the full Schema it would be no different than if you had manually expanded the whole thing.
The specific compiler would then be changed to use this class to load schemas and the arguments would be changed from input output_dir to input1 input2 ... inputN output_dir
It would probably make sense to support directories as well as files for inputs.
The use case this addresses is the common case of having shared headers, fields, or other includes that get used in a standard way across a large number of messages.
It is worth thinking this proposal through, since for an organization that needs to maintain a large set of messages, how they interconnect and what dependencies there are is quite critical.
-Jay
On Mon, May 3, 2010 at 10:48 AM, Scott Carey <[EMAIL PROTECTED]> wrote: > > On May 3, 2010, at 10:03 AM, Doug Cutting wrote: > >> Scott Carey wrote: >>> There has been talk that AvroGen would handle features like this (as well as many others) in time. However this is one that should probably be addressed at the JSON level regardless of the future direction of AvroGen. >> >> Note that JSON schemas and protocols need to be standalone, containing >> the full lexical closure of schemas referenced, when they are included >> in data files and exchanged in RPC handshakes without reference to >> external data. Thus I am reluctant to add a JSON syntax for file >> inclusion. Rather, I think a pre-processor is appropriate. The >> pre-processor would not be run on schemas included in files or exchanged >> in RPC handshakes, but would be run for schemas read from files. > > Exactly. I don't think we shouldn't change the JSON syntax by adding references or includes. > > We should just make the SpecificCompiler capable of reading a collection of files and figuring out how to compile them when there is not full lexical closure in a .avsc file. > File formats and RPC's have much stricter requirements than the SpecificCompiler. > >> >> I have experimented with using the m4 pre-processor for this purpose, >> and found it a bit awkward. Perhaps someone can develop macros for m4 >> that make it palatable, or perhaps we can develop a custom pre-processor >> for JSON. >> >> We might exploit otherwise-illegal JSON syntax, like backquotes, for >> pre-processor directives. An include might look something like: >> >> {"protocol": "org.foo.BarProtocol", >> "types": [ >> `include org.foo.Bar`, >> ... >> ] >> } >> > > Rather than use a preprocessor, Is it possible to have the SpecificCompiler search the other files in the set for types that can't be found in the current file? The result will be SpecificRecord objects that have their $SCHEMA field populated with a schema that has full lexical closure. > > Essentially, if given two files: > IpTypes.avsc -- > > [{"name": "com.somewhere.avro.IPV4", "type": "fixed", "size":4}, > {"name": "com.somewhere.avro.IPV6", "type": "fixed", "size":16}] > > MyRecord.avsc -- > > {"name": "com.somewhere.avro.MyRecord", "type": "record", "fields": [
+
Jay Kreps 2010-05-03, 18:47
|
|