|
Bill de hOra
2010-06-07, 20:53
Scott Carey
2010-06-07, 21:33
Doug Cutting
2010-06-07, 21:44
Bill de hOra
2010-06-07, 22:11
Philip Zeyliger
2010-06-07, 22:13
Bill de hOra
2010-06-07, 22:27
Doug Cutting
2010-06-07, 22:30
Scott Carey
2010-06-07, 22:34
Scott Carey
2010-06-07, 22:40
Doug Cutting
2010-06-07, 22:51
Philip Zeyliger
2010-06-07, 22:53
Jeff Hammerbacher
2010-06-07, 23:03
Bill de hOra
2010-06-08, 00:04
Scott Carey
2010-06-10, 16:27
Doug Cutting
2010-06-10, 18:49
Scott Carey
2010-06-10, 21:09
Scott Carey
2010-06-10, 21:13
Doug Cutting
2010-06-10, 21:57
Doug Cutting
2010-06-10, 22:04
Scott Carey
2010-06-10, 22:44
Doug Cutting
2010-06-10, 23:40
Eric Evans
2010-09-04, 02:09
Doug Cutting
2010-09-07, 16:38
|
-
schema defaults not reflected in generated objects (1.3.2)Bill de hOra 2010-06-07, 20:53
Hi,
I.m wondering if I'm missing an incantation for 'compile schema/protocol' in Avro tools to handle defaults. Given this schema, [[[ { "name": "avro.Message", "type": "record", "fields": [ { "name": "foo", "type": "string", "default":"quux" }, { "name": "bar", "type": "string" } ] } ]]] and running it through 'avro-tools-1.3.2.jar compile schema', I get this object [[[ package avro; @SuppressWarnings("all") public class Message extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord { public static final org.apache.avro.Schema SCHEMA$ = org.apache.avro.Schema.parse("{\"type\":\"record\",\"name\":\"Message\",\"namespace\":\"avro\",\"fields\":[{\"name\":\"foo\",\"type\":\"string\",\"default\":\"quux\"},{\"name\":\"bar\",\"type\":\"string\"}]}"); public org.apache.avro.util.Utf8 foo = new Utf8("quux"); public org.apache.avro.util.Utf8 bar; public org.apache.avro.Schema getSchema() { return SCHEMA$; } public java.lang.Object get(int field$) { switch (field$) { case 0: return foo; case 1: return bar; default: throw new org.apache.avro.AvroRuntimeException("Bad index"); } } @SuppressWarnings(value="unchecked") public void put(int field$, java.lang.Object value$) { switch (field$) { case 0: foo = (org.apache.avro.util.Utf8)value$; break; case 1: bar = (org.apache.avro.util.Utf8)value$; break; default: throw new org.apache.avro.AvroRuntimeException("Bad index"); } } } ]]] Should this line, [[[ public org.apache.avro.util.Utf8 foo; ]]] be, [[[ public org.apache.avro.util.Utf8 foo = new org.apache.avro.util.Utf8("quux"); ]]] ? Fwiw, changing the object's 'foo' field above to have the default stops the IPC server from barfing with an NPE (@ org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:133)) when a client creates a message object only sets 'bar', and then sends the Message to the server via . Bill
-
Re: schema defaults not reflected in generated objects (1.3.2)Scott Carey 2010-06-07, 21:33
No, it should not initialize the field to the default.
Default values are for readers, not writers. The intended use case is schema evolution. A reader on version N+1 has field foo with default 'quux', but reads data that was written without that field. The reader will see foo = 'quux'. A writer must always correctly provide data for all of the fields in the schema it declared it is writing. The specific API does make this a bit more difficult than it needs to be, it is something I would like to address at a later time. In the meantime I use wrapper classes around the SpecificRecord objects that provide restricted getters and restrictive constructors or factory methods to prevent other users from forgetting to set a field. -Scott On Jun 7, 2010, at 1:53 PM, Bill de hOra wrote: > Hi, > > I.m wondering if I'm missing an incantation for 'compile > schema/protocol' in Avro tools to handle defaults. Given this schema, > > [[[ > { > "name": "avro.Message", > "type": "record", > "fields": [ > { > "name": "foo", > "type": "string", > "default":"quux" > }, > { > "name": "bar", > "type": "string" > } > ] > > } > ]]] > > and running it through 'avro-tools-1.3.2.jar compile schema', I get this > object > > [[[ > package avro; > > @SuppressWarnings("all") > public class Message extends org.apache.avro.specific.SpecificRecordBase > implements org.apache.avro.specific.SpecificRecord { > public static final org.apache.avro.Schema SCHEMA$ = > org.apache.avro.Schema.parse("{\"type\":\"record\",\"name\":\"Message\",\"namespace\":\"avro\",\"fields\":[{\"name\":\"foo\",\"type\":\"string\",\"default\":\"quux\"},{\"name\":\"bar\",\"type\":\"string\"}]}"); > public org.apache.avro.util.Utf8 foo = new Utf8("quux"); > public org.apache.avro.util.Utf8 bar; > public org.apache.avro.Schema getSchema() { return SCHEMA$; } > public java.lang.Object get(int field$) { > switch (field$) { > case 0: return foo; > case 1: return bar; > default: throw new org.apache.avro.AvroRuntimeException("Bad index"); > } > } > @SuppressWarnings(value="unchecked") > public void put(int field$, java.lang.Object value$) { > switch (field$) { > case 0: foo = (org.apache.avro.util.Utf8)value$; break; > case 1: bar = (org.apache.avro.util.Utf8)value$; break; > default: throw new org.apache.avro.AvroRuntimeException("Bad index"); > } > } > } > ]]] > > Should this line, > > [[[ > public org.apache.avro.util.Utf8 foo; > ]]] > > be, > > [[[ > public org.apache.avro.util.Utf8 foo > = new org.apache.avro.util.Utf8("quux"); > ]]] > > ? > > Fwiw, changing the object's 'foo' field above to have the default stops > the IPC server from barfing with an NPE (@ > org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:133)) > when a client creates a message object only sets 'bar', and then sends > the Message to the server via . > > Bill > > >
-
Re: schema defaults not reflected in generated objects (1.3.2)Doug Cutting 2010-06-07, 21:44
On 06/07/2010 01:53 PM, Bill de hOra wrote:
> I.m wondering if I'm missing an incantation for 'compile > schema/protocol' in Avro tools to handle defaults. Given this schema, In Avro, default values are used by a reader when a writer's schema lacks a field. This permits one to add a new field and still process data written before that field was added, provided one specifies a default value for the new field. Similarly, if one removes a field, older code may still process the data, provided it has a default value. This is related but not the same as the default value for a Java instance. When one manually specifies a default value for a field in Java, one does not do this not primarily in order to future-proof the class, but typically so that one can avoid having to explicitly set the value somewhere else. What both concepts have in common is that they permit a developer to provide a reasonable value to be used when no more specific value is known. So it might be a useful feature if Java instance default values were acquired from the schema. For one thing, folks seem to expect it. But, from Avro's perspective, I don't think it's a bug. Does that make sense? Doug
-
Re: schema defaults not reflected in generated objects (1.3.2)Bill de hOra 2010-06-07, 22:11
Scott Carey wrote:
> No, it should not initialize the field to the default. > > Default values are for readers, not writers. The intended use case is schema evolution. This means writers can't leverage schema defaults, so writers should do something like this? Message message = new Message(); // no defaults set String quux = message .getSchema() .getField("foo") .defaultValue() .getTextValue(); message.foo=new Utf8(quux); [ignoring that the writer needs to know the schema type]. I suspect people will just write in garbage (like empty strings). > A writer must always correctly provide > data for all of the fields in the schema it declared > it is writing. Why is it incorrect to not provide defaults when defaults are part of the schema author's intention? Or put another way, why is reader/writer asymmetry a goal under a given schema? I see in the code that SpecificDatumWriter/GenericDatumWriter is passed the Schema - By all means crash on fields with no defaults, but I'm not clear on what harm is done by using default field data. The current code seems fragile in comparison. Bill
-
Re: schema defaults not reflected in generated objects (1.3.2)Philip Zeyliger 2010-06-07, 22:13
We could crash on serialization if the fields haven't been explicitly
set, but I'm +1 for the default values to be represented at generated object construction time. -- Philip On Mon, Jun 7, 2010 at 3:11 PM, Bill de hOra <[EMAIL PROTECTED]> wrote: > Scott Carey wrote: >> >> No, it should not initialize the field to the default. >> >> Default values are for readers, not writers. The intended use case is >> schema evolution. > > This means writers can't leverage schema defaults, so writers should do > something like this? > > Message message = new Message(); > // no defaults set > String quux = message > .getSchema() > .getField("foo") > .defaultValue() > .getTextValue(); > message.foo=new Utf8(quux); > > [ignoring that the writer needs to know the schema type]. I suspect people > will just write in garbage (like empty strings). > > >> A writer must always correctly provide >> data for all of the fields in the schema it declared >> it is writing. > > Why is it incorrect to not provide defaults when defaults are part of the > schema author's intention? Or put another way, why is reader/writer > asymmetry a goal under a given schema? > > I see in the code that SpecificDatumWriter/GenericDatumWriter is passed the > Schema - By all means crash on fields with no defaults, but I'm not clear on > what harm is done by using default field data. The current code seems > fragile in comparison. > > Bill > >
-
Re: schema defaults not reflected in generated objects (1.3.2)Bill de hOra 2010-06-07, 22:27
Doug Cutting wrote:
> On 06/07/2010 01:53 PM, Bill de hOra wrote: >> I.m wondering if I'm missing an incantation for 'compile >> schema/protocol' in Avro tools to handle defaults. Given this schema, > > In Avro, default values are used by a reader when a writer's schema > lacks a field. This permits one to add a new field and still process > data written before that field was added, provided one specifies a > default value for the new field. Similarly, if one removes a field, > older code may still process the data, provided it has a default value. > > This is related but not the same as the default value for a Java > instance. When one manually specifies a default value for a field in > Java, one does not do this not primarily in order to future-proof the > class, but typically so that one can avoid having to explicitly set the > value somewhere else. What both concepts have in common is that they > permit a developer to provide a reasonable value to be used when no more > specific value is known. > > So it might be a useful feature if Java instance default values were > acquired from the schema. For one thing, folks seem to expect it. But, > from Avro's perspective, I don't think it's a bug. > > Does that make sense? It does if Avro == specification (my reply to Scott is quizzing the Java generator code btw, not the spec). On that ground, I can't argue that the asymmetry is a bug, but it might turn out to be fragile, even if it's stated somewhere in the spec that writers are required to roundtrip defaults back to the server (the spec isn't explicit wrt defaults for writers, or whether schema should be shipped with data in that case). Bill
-
Re: schema defaults not reflected in generated objects (1.3.2)Doug Cutting 2010-06-07, 22:30
On 06/07/2010 03:11 PM, Bill de hOra wrote:
> This means writers can't leverage schema defaults, so writers should do > something like this? > > Message message = new Message(); > // no defaults set > String quux = message > .getSchema() > .getField("foo") > .defaultValue() > .getTextValue(); > message.foo=new Utf8(quux); > > [ignoring that the writer needs to know the schema type]. I suspect > people will just write in garbage (like empty strings). No, we don't expect folks to do that. If a writer never sets a field then they might be better off dropping that field from their schema. If the writer only rarely sets it, then a schema which is a union with null might be better, making the field optional. But if the field is usually set but it's awkward for the programmer to know whether its set, then automatically filling in a default might be a useful feature and the default from the schema is probably a good value to use. Like Philip, I too am +1 for enhancing the SpecificCompiler to set default values from the schema in generated code. The only downside I see is perhaps a slight performance loss: if the default value is always overwritten then the allocation and setting of it will still be executed for each instance. > Why is it incorrect to not provide defaults when defaults are part of > the schema author's intention? Or put another way, why is reader/writer > asymmetry a goal under a given schema? It's not incorrect nor is asymmetry a goal. (I don't think Scott meant either of those to be the case.) Java instance default values were not the primary purpose of default values in a schema, but they're a fine, complementary use of them. Doug
-
Re: schema defaults not reflected in generated objects (1.3.2)Scott Carey 2010-06-07, 22:34
Creating default fields for objects would have performance issues if they are new instances -- new Utf8() and new YourClassHere() are not free. So 'foo = new Utf8("foo"); is not right, but assignment from a static default would be fine. But unless these objects are immutable, a client could change the default.
Ideally, the specific and generic APIs can handle this better. A getter can return a default value if its field is null, or generated classes can be more sophisticated, removing default constructors and providing constructors or static factory methods that require a user to provide all 'default-less' fields up front. In the long run I'd like to have these sort of more powerful and flexible generated objects with various user-configured options, but at this time it is not there. Also, consider how Unions complicate things here. Right now they are not so fun to deal with unless it is only a union of NULL and one other type. Client code has to know the exact classes/types to inspect to resolve the union. Until there are enhancements to the API, if you are concerned with users putting garbage in, I suggest you write a wrapper class that handles this. Have users use that class rather than the classes generated by the specific compiler. On Jun 7, 2010, at 3:11 PM, Bill de hOra wrote: > Scott Carey wrote: >> No, it should not initialize the field to the default. >> >> Default values are for readers, not writers. The intended use case is schema evolution. > > This means writers can't leverage schema defaults, so writers should do > something like this? > > Message message = new Message(); > // no defaults set > String quux = message > .getSchema() > .getField("foo") > .defaultValue() > .getTextValue(); > message.foo=new Utf8(quux); > > [ignoring that the writer needs to know the schema type]. I suspect > people will just write in garbage (like empty strings). > > >> A writer must always correctly provide >> data for all of the fields in the schema it declared >> it is writing. > > Why is it incorrect to not provide defaults when defaults are part of > the schema author's intention? Or put another way, why is reader/writer > asymmetry a goal under a given schema? > > I see in the code that SpecificDatumWriter/GenericDatumWriter is passed > the Schema - By all means crash on fields with no defaults, but I'm not > clear on what harm is done by using default field data. The current code > seems fragile in comparison. > > Bill >
-
Re: schema defaults not reflected in generated objects (1.3.2)Scott Carey 2010-06-07, 22:40
On Jun 7, 2010, at 3:30 PM, Doug Cutting wrote: > On 06/07/2010 03:11 PM, Bill de hOra wrote: >> This means writers can't leverage schema defaults, so writers should do >> something like this? >> >> Message message = new Message(); >> // no defaults set >> String quux = message >> .getSchema() >> .getField("foo") >> .defaultValue() >> .getTextValue(); >> message.foo=new Utf8(quux); >> >> [ignoring that the writer needs to know the schema type]. I suspect >> people will just write in garbage (like empty strings). > > No, we don't expect folks to do that. If a writer never sets a field > then they might be better off dropping that field from their schema. If > the writer only rarely sets it, then a schema which is a union with null > might be better, making the field optional. But if the field is usually > set but it's awkward for the programmer to know whether its set, then > automatically filling in a default might be a useful feature and the > default from the schema is probably a good value to use. > > Like Philip, I too am +1 for enhancing the SpecificCompiler to set > default values from the schema in generated code. The only downside I > see is perhaps a slight performance loss: if the default value is always > overwritten then the allocation and setting of it will still be executed > for each instance. For my specific use case, defaults being set all the time would hurt performance quite a bit. (the schema not trivial -- ~5k in JSON) If the specific compiler generated a couple constructors -- * A default empty argument constructor -- fills fields with defaults. * A constructor with all fields passed in -- assigns fields from the constructor and does nothing with defaults. Then I can use the latter to avoid the performance issue. FWIW, I have written wrappers around all of my datum classes to deal with this, and I have wanted to either modify the SpecificCompiler or have a new generated object 'flavor' that had the safe-construction, union resolution, and getter methods that my wrappers provide. I just haven't had time and its lower priority than some other enhancements I'll need soon. -Scott > >> Why is it incorrect to not provide defaults when defaults are part of >> the schema author's intention? Or put another way, why is reader/writer >> asymmetry a goal under a given schema? > > It's not incorrect nor is asymmetry a goal. (I don't think Scott meant > either of those to be the case.) Java instance default values were not > the primary purpose of default values in a schema, but they're a fine, > complementary use of them. > > Doug
-
Re: schema defaults not reflected in generated objects (1.3.2)Doug Cutting 2010-06-07, 22:51
On 06/07/2010 03:40 PM, Scott Carey wrote:
> If the specific compiler generated a couple constructors -- > * A default empty argument constructor -- fills fields with defaults. > * A constructor with all fields passed in -- assigns fields from the constructor and does nothing with defaults. > > Then I can use the latter to avoid the performance issue. A problem with the latter is that such constructors can be fragile. If you have a number of integer fields and reorder them in your schema then the geneated constructor arguments will be re-ordered but the compiler will not notice a problem. Another option might be to have a flag on the compiler to control generation of default value settings. > FWIW, I have written wrappers around all of my datum classes to deal > with this, FWIW, Philip claims that (from his Google experience with protobufs) that the best practice is always to use a manually-written wrapper around generated classes. > and I have wanted to either modify the SpecificCompiler or > have a new generated object 'flavor' that had the safe-construction, > union resolution, and getter methods that my wrappers provide. I > just haven't had time and its lower priority than some other > enhancements I'll need soon. Perhaps you can just post some pseudo code that illustrates what would be generated, for discussion? For generated code like this I see no point to getter/setter methods. They seem equivalent to public fields. But perhaps I'm missing something. Doug
-
Re: schema defaults not reflected in generated objects (1.3.2)Philip Zeyliger 2010-06-07, 22:53
On Mon, Jun 7, 2010 at 3:40 PM, Scott Carey <[EMAIL PROTECTED]> wrote:
> > On Jun 7, 2010, at 3:30 PM, Doug Cutting wrote: > >> On 06/07/2010 03:11 PM, Bill de hOra wrote: >>> This means writers can't leverage schema defaults, so writers should do >>> something like this? >>> >>> Message message = new Message(); >>> // no defaults set >>> String quux = message >>> .getSchema() >>> .getField("foo") >>> .defaultValue() >>> .getTextValue(); >>> message.foo=new Utf8(quux); >>> >>> [ignoring that the writer needs to know the schema type]. I suspect >>> people will just write in garbage (like empty strings). >> >> No, we don't expect folks to do that. If a writer never sets a field >> then they might be better off dropping that field from their schema. If >> the writer only rarely sets it, then a schema which is a union with null >> might be better, making the field optional. But if the field is usually >> set but it's awkward for the programmer to know whether its set, then >> automatically filling in a default might be a useful feature and the >> default from the schema is probably a good value to use. >> >> Like Philip, I too am +1 for enhancing the SpecificCompiler to set >> default values from the schema in generated code. The only downside I >> see is perhaps a slight performance loss: if the default value is always >> overwritten then the allocation and setting of it will still be executed >> for each instance. > > For my specific use case, defaults being set all the time would hurt performance quite a bit. (the schema not trivial -- ~5k in JSON) > > If the specific compiler generated a couple constructors -- > * A default empty argument constructor -- fills fields with defaults. > * A constructor with all fields passed in -- assigns fields from the constructor and does nothing with defaults. Unfortunately, the latter is bad for compatibility. If there are two fields of the same time, it's pretty easy (since there's not necessarily a canonical order) for this to introduce a nasty incompatibility. This is not theoretical: Thrift ran smack into this. PBs have "Builder" objects, which I imagine don't satisfy your performance worries. If you have getters and setters, you can implement setFoo() with "this.fieldsSet.setBit(FOO); this.foo = foo);" (i.e., use a bitmap to remember what has and hasn't been set). The current API doesn't use setters, I think, though, so this wouldn't be easily backwards compatible.
-
Re: schema defaults not reflected in generated objects (1.3.2)Jeff Hammerbacher 2010-06-07, 23:03
I'm on a remarkably slow internet connection, so updating the Confluence FAQ
is not happening for me. Could someone please add "Why doesn't the Java specific compiler generate a constructor with all fields for records?" to the FAQ with Philip's answer below? I asked this question two weeks ago, and I'm guessing we'll only get it more regularly. > If the specific compiler generated a couple constructors -- > > * A default empty argument constructor -- fills fields with defaults. > > * A constructor with all fields passed in -- assigns fields from the > constructor and does nothing with defaults. > > Unfortunately, the latter is bad for compatibility. If there are two > fields of the same time, it's pretty easy (since there's not > necessarily a canonical order) for this to introduce a nasty > incompatibility. This is not theoretical: Thrift ran smack into this. > PBs have "Builder" objects, which I imagine don't satisfy your > performance worries. > > If you have getters and setters, you can implement setFoo() with > "this.fieldsSet.setBit(FOO); this.foo = foo);" (i.e., use a bitmap to > remember what has and hasn't been set). The current API doesn't use > setters, I think, though, so this wouldn't be easily backwards > compatible. >
-
Re: schema defaults not reflected in generated objects (1.3.2)Bill de hOra 2010-06-08, 00:04
> Creating default fields for objects would have performance
> issues if they are new instances -- new Utf8() and > new YourClassHere() are not free. Indeed for the read scenario (although I guess clients will need to check all marshaled fields for null, just in case). Alternative for writes, if you're concerned about performance issues, would be to not require the writer to send them, or have the DatumWriters populate the data lazily. I gather the former would be a subtle design change, and the latter would be a patch on the writers to leverage the schema on the write* methods. Bill Scott Carey wrote: > Creating default fields for objects would have performance issues if they are new instances -- new Utf8() and new YourClassHere() are not free. So 'foo = new Utf8("foo"); is not right, but assignment from a static default would be fine. But unless these objects are immutable, a client could change the default. > > Ideally, the specific and generic APIs can handle this better. A getter can return a default value if its field is null, or generated classes can be more sophisticated, removing default constructors and providing constructors or static factory methods that require a user to provide all 'default-less' fields up front. In the long run I'd like to have these sort of more powerful and flexible generated objects with various user-configured options, but at this time it is not there. > Also, consider how Unions complicate things here. Right now they are not so fun to deal with unless it is only a union of NULL and one other type. Client code has to know the exact classes/types to inspect to resolve the union. > > Until there are enhancements to the API, if you are concerned with users putting garbage in, I suggest you write a wrapper class that handles this. Have users use that class rather than the classes generated by the specific compiler. > > > On Jun 7, 2010, at 3:11 PM, Bill de hOra wrote: > >> Scott Carey wrote: >>> No, it should not initialize the field to the default. >>> >>> Default values are for readers, not writers. The intended use case is schema evolution. >> This means writers can't leverage schema defaults, so writers should do >> something like this? >> >> Message message = new Message(); >> // no defaults set >> String quux = message >> .getSchema() >> .getField("foo") >> .defaultValue() >> .getTextValue(); >> message.foo=new Utf8(quux); >> >> [ignoring that the writer needs to know the schema type]. I suspect >> people will just write in garbage (like empty strings). >> >> >>> A writer must always correctly provide >>> data for all of the fields in the schema it declared >>> it is writing. >> Why is it incorrect to not provide defaults when defaults are part of >> the schema author's intention? Or put another way, why is reader/writer >> asymmetry a goal under a given schema? >> >> I see in the code that SpecificDatumWriter/GenericDatumWriter is passed >> the Schema - By all means crash on fields with no defaults, but I'm not >> clear on what harm is done by using default field data. The current code >> seems fragile in comparison. >> >> Bill >> >
-
Re: schema defaults not reflected in generated objects (1.3.2)Scott Carey 2010-06-10, 16:27
On Jun 7, 2010, at 3:51 PM, Doug Cutting wrote: > On 06/07/2010 03:40 PM, Scott Carey wrote: >> If the specific compiler generated a couple constructors -- >> * A default empty argument constructor -- fills fields with defaults. >> * A constructor with all fields passed in -- assigns fields from the constructor and does nothing with defaults. >> >> Then I can use the latter to avoid the performance issue. > > A problem with the latter is that such constructors can be fragile. If > you have a number of integer fields and reorder them in your schema then > the geneated constructor arguments will be re-ordered but the compiler > will not notice a problem. But, if you have a wrapper class, it will have the same fragility -- users will make that same mistake. Why not solve it here? My requirement is that objects are fail-fast on construction (or near to it) if the resulting state would result in a failure to serialize. Its not acceptable that a client only finds out that the object can't be serialized when it is later written. That occurs asynchronously in my case, and its almost impossible to track down 'who forgot to set field X' otherwise. So, my wrappers only have public static factories that require all fields, though many of these are our business objects and not each individual field - this avoids most of the field re-order problem above. The factory throws an exception if the data provided does not create an object that is serializable. > > Another option might be to have a flag on the compiler to control > generation of default value settings. > >> FWIW, I have written wrappers around all of my datum classes to deal >> with this, > > FWIW, Philip claims that (from his Google experience with protobufs) > that the best practice is always to use a manually-written wrapper > around generated classes. I agree it is a best practice to have wrapper _methods_ (factories, helpers, etc). A wrapper API can translate business objects to Avro and otherwise abstract away the avro details from other users. But I'm not so sure about wrapper instances with state. Its a much larger maintenance burden -- suddenly you need wrappers for all the other Avro APIs too -- file read/write, etc to handle your wrapper type rather than the avro record type. For every Avro SpecificRecord, I have an object instance that wraps it and hides the construction details, union resolution details, and setter/constructor/factory validation. These hide the fields and expose them with getters. > >> and I have wanted to either modify the SpecificCompiler or >> have a new generated object 'flavor' that had the safe-construction, >> union resolution, and getter methods that my wrappers provide. I >> just haven't had time and its lower priority than some other >> enhancements I'll need soon. > > Perhaps you can just post some pseudo code that illustrates what would > be generated, for discussion? I'll post some examples later, Its been a hectic week fighting other fires and they're still not all extinguished. > > For generated code like this I see no point to getter/setter methods. > They seem equivalent to public fields. But perhaps I'm missing something. * Setters can check the data. For example they can ensure that the type on a Union is valid. Otherwise you only find errors when you attempt to serialize it. * If you have a public field, you can't force users through the setter, so if you have a setter, you need a getter. * If the implementation details change, you break your API. For example, if we go from an int to an Integer on the internal member, or a Utf8 to a String, a geter/setter can hide that. The API is more brittle if the fields are not encapsulated. * Reflection is actually faster on method access than field access. I propose: * We use getters/setters, and allow setters and constructors to be package protected with an option so the objects can be safely passed around outside the package without stateful wrapper objects. Users can write factory methods or other static helper methods to abstract away as much as they would like from there. * Setters should not allow invalid data to be set. * Provide some built-in mechanism to resolve unions on a read and verify them on a write. Perhaps a inner static enum for each union, and a getter for each branch. * Unions with one type and NULL should not translate to Object, but rather be that type and allow null.
-
Re: schema defaults not reflected in generated objects (1.3.2)Doug Cutting 2010-06-10, 18:49
On 06/10/2010 09:27 AM, Scott Carey wrote:
> I propose: > * We use getters/setters, and allow setters and constructors to be > package protected with an option so the objects can be safely passed > around outside the package without stateful wrapper objects. Users > can write factory methods or other static helper methods to abstract > away as much as they would like from there. > * Setters should not allow invalid data to be set. I'd be okay with these changes. +1 > * Provide some built-in mechanism to resolve unions on a read and > verify them on a write. Perhaps a inner static enum for each union, > and a getter for each branch. So a schema like: {"type":"record", "name":"Foo", "fields":[ {"name":"bar", "type":["int", "float"]}]} might generate something like: public class Foo { public enum BarType { INT, FLOAT } private BarType bar$Type; private Object bar; public void setBar(int value) { bar = value; bar$Type = INT; } public void setBar(float value) { bar = value; bar$Type = FLOAT; } public int getBar$Int() { if (bar$Type != INT) throw new ...; return (int)bar; } public float getBar$Float() { if (bar$Type != FLOAT) throw new ...; return (float)bar; } } Is that the sort of thing you have in mind? I've added dollar signs to avoid potential conflicts, e.g., if the user had a field named 'barType' or 'barInt'. I think its best to always add the dollars, since otherwise it could get awkward if a field named 'barInt' were added later. > * Unions with one type and NULL should not translate to Object, but > rather be that type and allow null. That's already the case, no? Also, you didn't provide a proposal for constructors, yet, right? Perhaps all of the setters could be on a generated builder class, with an instance only returned when it's completely populated? Doug
-
Re: schema defaults not reflected in generated objects (1.3.2)Scott Carey 2010-06-10, 21:09
On Jun 10, 2010, at 11:49 AM, Doug Cutting wrote: > On 06/10/2010 09:27 AM, Scott Carey wrote: >> I propose: >> * We use getters/setters, and allow setters and constructors to be >> package protected with an option so the objects can be safely passed >> around outside the package without stateful wrapper objects. Users >> can write factory methods or other static helper methods to abstract >> away as much as they would like from there. >> * Setters should not allow invalid data to be set. > > I'd be okay with these changes. +1 > >> * Provide some built-in mechanism to resolve unions on a read and >> verify them on a write. Perhaps a inner static enum for each union, >> and a getter for each branch. > > So a schema like: > > {"type":"record", "name":"Foo", "fields":[ > {"name":"bar", "type":["int", "float"]}]} > > might generate something like: > > public class Foo { > public enum BarType { INT, FLOAT } > private BarType bar$Type; > private Object bar; > public void setBar(int value) { bar = value; bar$Type = INT; } > public void setBar(float value) { bar = value; bar$Type = FLOAT; } > public int getBar$Int() { > if (bar$Type != INT) throw new ...; > return (int)bar; > } > public float getBar$Float() { > if (bar$Type != FLOAT) throw new ...; > return (float)bar; > } > } > I was trying to think of a way to storing the type in there to keep the memory footprint minimized. The getter name disambiguation needs a good format, and the $ delimiter above would work well and be clear to users. However, I was thinking something more along these lines: {code} public class Foo { // intrinsic simple types could be shared rather than in each class // named types must be per union. public enum BarType { INT(Integer.class), FLOAT(Float.class); private Class<?> type; // internal use only BarType(Class<?> type) { this.type = type; } } private Object bar; // Integer vs int: avoid auto-unbox, auto-rebox if the user has an Integer already public void setBar(Integer value) { bar = value; } // same, Float vs float public void setBar(Float value) { bar = value; } public Integer getBar$Int() { // we could return null instead of throwing. Return Integer, client can un-box if needed if (bar.getClass() == INT.type) return bar; else throw new ...; } public Float getBar$Float() { if (bar.getClass() == FLOAT.type) return bar; else throw new ...; } public BarType getBar$Type() { // we could use a static ConcurrentHashMap<Class<?>, BarType> instead of cascaded if/else if/ ... Class<?> c = bar.getClass(); if (INT.type == c) { return INT; } else if (FLOAT.type == c) { return FLOAT; } else { throw new NeverHappensException(); // :) } } {code} I think the above is thread-safe with no race conditions, though if one thread is calling the setter toggling the type another thread calling getBar$Type might see a 'delayed' type. Since the ENUM is exposed to the client, it also needs a regular naming pattern so that schema evolution doesn't change the method names. A user accessing an enum can then switch on the enum and call the appropriate method. > Is that the sort of thing you have in mind? I've added dollar signs to > avoid potential conflicts, e.g., if the user had a field named 'barType' > or 'barInt'. I think its best to always add the dollars, since > otherwise it could get awkward if a field named 'barInt' were added later. > >> * Unions with one type and NULL should not translate to Object, but >> rather be that type and allow null. > > That's already the case, no? > > Also, you didn't provide a proposal for constructors, yet, right? Yes, I'll need a more complicated example to discuss that one appropriately. A pattern that aids in business object <> Avro translation, especially in the presence of class hierarchies and a couple different object composition patterns. A Union as a representation of a subclass is tricky. So is it when modeling composition. Providing builder classes rather than constructors could be extremely useful to aid in encapsulation. It should address my constructor concern and remain flexible.
-
Re: schema defaults not reflected in generated objects (1.3.2)Scott Carey 2010-06-10, 21:13
On Jun 10, 2010, at 2:09 PM, Scott Carey wrote: > > I was trying to think of a way to storing the type in there to keep the memory footprint minimized. a way to *_avoid_* storing the type. Also, it would be good to have the type state in one place atomically -- the value itself. If its in two places then race conditions become a problem.
-
Re: schema defaults not reflected in generated objects (1.3.2)Doug Cutting 2010-06-10, 21:57
Re boxing: Good idea to minimize it.
Re storing type: this is a time/space tradeoff. If you store the union type in each record then you can write instances faster, since you can use type.ordinal(), and also dispatch switch statements for the value faster in user code. A hash table lookup is probably not a lot faster than an if/then/elseif/... unless the union is really big. Personally I'd opt for performance over slightly smaller instances here. I suppose we could offer both options, since it shouldn't alter the public API. Also, we don't need to store a class in the enum. At code generation time we can simply write, e.g., Integer.class wherever you put INT.type. Doug On 06/10/2010 02:09 PM, Scott Carey wrote: > > On Jun 10, 2010, at 11:49 AM, Doug Cutting wrote: > >> On 06/10/2010 09:27 AM, Scott Carey wrote: >>> I propose: >>> * We use getters/setters, and allow setters and constructors to be >>> package protected with an option so the objects can be safely passed >>> around outside the package without stateful wrapper objects. Users >>> can write factory methods or other static helper methods to abstract >>> away as much as they would like from there. >>> * Setters should not allow invalid data to be set. >> >> I'd be okay with these changes. +1 >> >>> * Provide some built-in mechanism to resolve unions on a read and >>> verify them on a write. Perhaps a inner static enum for each union, >>> and a getter for each branch. >> >> So a schema like: >> >> {"type":"record", "name":"Foo", "fields":[ >> {"name":"bar", "type":["int", "float"]}]} >> >> might generate something like: >> >> public class Foo { >> public enum BarType { INT, FLOAT } >> private BarType bar$Type; >> private Object bar; >> public void setBar(int value) { bar = value; bar$Type = INT; } >> public void setBar(float value) { bar = value; bar$Type = FLOAT; } >> public int getBar$Int() { >> if (bar$Type != INT) throw new ...; >> return (int)bar; >> } >> public float getBar$Float() { >> if (bar$Type != FLOAT) throw new ...; >> return (float)bar; >> } >> } >> > > I was trying to think of a way to storing the type in there to keep the memory footprint minimized. > The getter name disambiguation needs a good format, and the $ delimiter above would work well and be clear to users. > However, I was thinking something more along these lines: > > {code} > public class Foo { > // intrinsic simple types could be shared rather than in each class > // named types must be per union. > public enum BarType { > INT(Integer.class), > FLOAT(Float.class); > private Class<?> type; // internal use only > BarType(Class<?> type) { > this.type = type; > } > } > private Object bar; > // Integer vs int: avoid auto-unbox, auto-rebox if the user has an Integer already > public void setBar(Integer value) { bar = value; } > // same, Float vs float > public void setBar(Float value) { bar = value; } > public Integer getBar$Int() { > // we could return null instead of throwing. Return Integer, client can un-box if needed > if (bar.getClass() == INT.type) return bar; else throw new ...; > } > public Float getBar$Float() { > if (bar.getClass() == FLOAT.type) return bar; else throw new ...; > } > public BarType getBar$Type() { > // we could use a static ConcurrentHashMap<Class<?>, BarType> instead of cascaded if/else if/ ... > Class<?> c = bar.getClass(); > if (INT.type == c) { > return INT; > } else if (FLOAT.type == c) { > return FLOAT; > } else { > throw new NeverHappensException(); // :) > } > } > {code} > > I think the above is thread-safe with no race conditions, though if one thread is calling the setter toggling the type another thread calling getBar$Type might see a 'delayed' type. > > Since the ENUM is exposed to the client, it also needs a regular naming pattern so that schema evolution doesn't change the method names.
-
Re: schema defaults not reflected in generated objects (1.3.2)Doug Cutting 2010-06-10, 22:04
On 06/10/2010 02:13 PM, Scott Carey wrote:
> Also, it would be good to have the type state in one place atomically > -- the value itself. If its in two places then race conditions > become a problem. That would be a nice feature, but I don't see it as a requirement. A setter method need not always be atomic or thread-safe. Doug
-
Re: schema defaults not reflected in generated objects (1.3.2)Scott Carey 2010-06-10, 22:44
On Jun 10, 2010, at 2:57 PM, Doug Cutting wrote: > Re boxing: Good idea to minimize it. > > > Also, we don't need to store a class in the enum. At code generation > time we can simply write, e.g., Integer.class wherever you put INT.type. Good point, this is generated code, if we want to replace Integer.class with AvroSuperInteger.class it would be hidden from the API. > Re storing type: this is a time/space tradeoff. If you store the union > type in each record then you can write instances faster, since you can > use type.ordinal(), and also dispatch switch statements for the value > faster in user code. I guess we'll have to see what the time/space tradeoff looks like with some testing later. For writing, its still a switch on the enum without the persisted type, something like BarType bt = getBar$Type(); write(bt.ordinal()); switch(bt) { case INT: writeInt(getBar$Int()) ... case FLOAT: writeFloat(getBar$Float()) ... } The getBar$Type() requires the cascaded if/else though. And the get has a conditional on getClass(). The latter should be very fast, the former less so. This could be inverted to make the datum class be responsible for the writing too: // note, thread safe writeBarUnion(Encoder e) { Object bar = this.bar; Class<?> c = bar.getClass(); BarType bt = getBar$Type(c); // private helper writeInt(bt.ordinal()); switch(bt) { case INT: writeInt((int)bar); break; case FLOAT: writeFloat((float)bar); break; } } It might even make sense to just go all the way and add a writeTo(Encoder e); Though that has some implications for lots of other stuff -- anything using a DatumWriter. It does open op opportunities for optimizations and simplification of union resolution without traversing the Schema. > A hash table lookup is probably not a lot faster > than an if/then/elseif/... unless the union is really big. Personally > I'd opt for performance over slightly smaller instances here. I suppose > we could offer both options, since it shouldn't alter the public API. > Sometimes memory size is performance. It depends on whether you're streaming these in (size isn't a big deal) or holding on to them for a long time. I'd rather solidify the API though, and keep thinking through all the possibilities. As long as all these options are available with one API then minor implementation detail options like this can be dealt with later. One other thing that would be very useful longer term would be to unify the Specific and Reflect APIs. What I mean by that, is that the Specific API should generate classes in such a way that the Reflect API would serialize the objects to the same schema if asked to reflect on the object. This probably means simply decorating the Specific objects with annotations. > > Doug > > On 06/10/2010 02:09 PM, Scott Carey wrote: >> >> On Jun 10, 2010, at 11:49 AM, Doug Cutting wrote: >> >>> On 06/10/2010 09:27 AM, Scott Carey wrote: >>>> I propose: >>>> * We use getters/setters, and allow setters and constructors to be >>>> package protected with an option so the objects can be safely passed >>>> around outside the package without stateful wrapper objects. Users >>>> can write factory methods or other static helper methods to abstract >>>> away as much as they would like from there. >>>> * Setters should not allow invalid data to be set. >>> >>> I'd be okay with these changes. +1 >>> >>>> * Provide some built-in mechanism to resolve unions on a read and >>>> verify them on a write. Perhaps a inner static enum for each union, >>>> and a getter for each branch. >>> >>> So a schema like: >>> >>> {"type":"record", "name":"Foo", "fields":[ >>> {"name":"bar", "type":["int", "float"]}]} >>> >>> might generate something like: >>> >>> public class Foo { >>> public enum BarType { INT, FLOAT } >>> private BarType bar$Type; >>> private Object bar; >>> public void setBar(int value) { bar = value; bar$Type = INT; } >>> public void setBar(float value) { bar = value; bar$Type = FLOAT; }
-
Re: schema defaults not reflected in generated objects (1.3.2)Doug Cutting 2010-06-10, 23:40
On 06/10/2010 03:44 PM, Scott Carey wrote:
> The getBar$Type() requires the cascaded if/else though. Right, that's the cost I was alluding too. Some unions are reasonably large, e.g., http://s.apache.org/Events.avpr, where the union at the end has 16 branches, and the cost of the cascaded if/else could be higher than we'd like. It'd be nice if unions with even hundreds of branches were still fast to read and write. Doug
-
Re: schema defaults not reflected in generated objects (1.3.2)Eric Evans 2010-09-04, 02:09
Sorry to dredge up such an old thread, but... On Mon, 2010-06-07 at 15:30 -0700, Doug Cutting wrote: > On 06/07/2010 03:11 PM, Bill de hOra wrote: > > This means writers can't leverage schema defaults, so writers should do > > something like this? > > > > Message message = new Message(); > > // no defaults set > > String quux = message > > .getSchema() > > .getField("foo") > > .defaultValue() > > .getTextValue(); > > message.foo=new Utf8(quux); > > > > [ignoring that the writer needs to know the schema type]. I suspect > > people will just write in garbage (like empty strings). > > No, we don't expect folks to do that. If a writer never sets a field > then they might be better off dropping that field from their schema. If > the writer only rarely sets it, then a schema which is a union with null > might be better, making the field optional. But if the field is usually > set but it's awkward for the programmer to know whether its set, then > automatically filling in a default might be a useful feature and the > default from the schema is probably a good value to use. FWIW, the situation that keeps reoccurring in Cassandra is one where there are sane defaults that work 90% of the time. Using factories to apply defaults will work, but then they would need to be kept in sync across all of the clients (of which there are many). Making the field optional and applying the defaults on the server side will work, but these defaults are technically part of the schema and it seems less than optimal to separate them like this (and it hides the defaults from clients). > Like Philip, I too am +1 for enhancing the SpecificCompiler to set > default values from the schema in generated code. The only downside I > see is perhaps a slight performance loss: if the default value is always > overwritten then the allocation and setting of it will still be executed > for each instance. This covers Java when generated classes are used, but the other writers would need to be changed as well. I'm willing to work on patches for this, is this still something you'd considering merging? Should it be made optional? -- Eric Evans [EMAIL PROTECTED]
-
Re: schema defaults not reflected in generated objects (1.3.2)Doug Cutting 2010-09-07, 16:38
On 09/03/2010 07:09 PM, Eric Evans wrote:
> This covers Java when generated classes are used, but the other writers > would need to be changed as well. I'm willing to work on patches for > this, is this still something you'd considering merging? Should it be > made optional? I'd love to see a feature that automatically stores default values into data structures as they're created. Philip's converted Java's code generator to be template based: https://issues.apache.org/jira/browse/AVRO-648 This should get committed soon (1.4.1?). Then we can add alternate sets of templates. We can add one that stores default values. We can also add templates to emit code for languages besides Java. Doug |