|
|
-
Why is the String type a Schema property?
Mark Hayes 2012-05-24, 04:10
Hi,
This is my first post to this list. I'm writing a binding API for our database product, to allow users to easily store Avro binary data in the database and use any of the built-in Avro object representations (Generic, Specific, etc) as well as one we've added (JsonNode) by subclassing the Generic classes.
In our binding API, we don't support object reuse. So the Utf8 class has no real benefit and String would be more convenient for our users. I see that type String can be used (rather than the Utf8 default) by two different mechanisms: ( 1) I can override GenericDatumReader.readString, or (2) I can set the "avro.java.string" property for each string field in the schema to "String".
I would like to do (1) because it is cleaner (the schema isn't cluttered with metadata that is the same for every string field) and because I don't think information about the object representation logically belongs in the schema (for two users of the same schema, one may use an object representation with String and the other a representation with Utf8).
So my question is: Why is the string type a property in the schema, i.e., why does option (2) exist in Avro? Is there something I'm missing about its benefit?
Also, if I use option (1), is this likely to cause compatibility problems with other components that process Avro data and Avro schemas, such as Hadoop? Our users may create a schema and store the data for that schema in our database, and then later use the same schema for processing this data in Hadoop. Hadoop is just one example, since one of the reasons we chose Avro is because of its widespread use in many components. Does there typically need to be agreement about the string type among different entities that process data for a shared schema?
Thanks in advance for any advice. --mark
-
Re: Why is the String type a Schema property?
Doug Cutting 2012-05-24, 16:08
On 05/23/2012 09:10 PM, Mark Hayes wrote: > So my question is: Why is the string type a property in the schema, > i.e., why does option (2) exist in Avro? Is there something I'm missing > about its benefit?
It's for back compatibility. Strings in specific and generic representations were originally always read as Utf8, so many existing applications expect strings to be Utf8. Rather than breaking all of these applications we instead permitted folks to opt in to this change. For applications that use the specific representation (those that generate code) and wish to change from Utf8 to String it requires only adding a single parameter to their Maven configuration, so it's not very invasive. The runtime must know which representation is desired for strings, and the Schema is the convenient runtime structure to annotate.
Note that we'd prefer not to instead make it a property of the Encoder/Decoder or DatumWriter/DatumReader since we permit folks to intermix reflect, specific and generic objects in a tree. For example, one may have a reflected datum that has some fields which are defined by generated specific classes and other fields which correspond to no class on the classpath so the generic representation is used. This flexibility permits classes like org.apache.avro.mapred.Pair<X,Y>, which can contain reflect, specific or generic instances.
> Also, if I use option (1), is this likely to cause compatibility > problems with other components that process Avro data and Avro schemas, > such as Hadoop?
No, I don't think so. If you use your own DatumReader implementation to read your data then that should not affect anyone else. Reflect, specific and generic inherit from one another, sharing many parts of their implementation, so changes to these must keep the others in mind, but if you've defined a new DatumReader that's only used to read your data that should not affect any other applications.
> Our users may create a schema and store the data for > that schema in our database, and then later use the same schema for > processing this data in Hadoop. Hadoop is just one example, since one > of the reasons we chose Avro is because of its widespread use in many > components. Does there typically need to be agreement about the string > type among different entities that process data for a shared schema?
Not really. If you're reading things that correspond to a generated specific class then it will always use the representation it expects, since generated code contains its schema. If you use reflection to read things into instances of a non-generated class then it will generally read strings as java.lang.String. The generic representation will use Utf8 for unannotated string schemas. Your map and reduce functions will need to be written accordingly.
I hope this helps!
Doug
-
Re: Why is the String type a Schema property?
Mark Hayes 2012-05-24, 17:28
On Thu, May 24, 2012 at 9:08 AM, Doug Cutting <[EMAIL PROTECTED]> wrote:
> On 05/23/2012 09:10 PM, Mark Hayes wrote: > >> So my question is: Why is the string type a property in the schema, >> i.e., why does option (2) exist in Avro? Is there something I'm missing >> about its benefit? >> > > It's for back compatibility. Strings in specific and generic > representations were originally always read as Utf8, so many existing > applications expect strings to be Utf8. Rather than breaking all of these > applications we instead permitted folks to opt in to this change. For > applications that use the specific representation (those that generate > code) and wish to change from Utf8 to String it requires only adding a > single parameter to their Maven configuration, so it's not very invasive. > The runtime must know which representation is desired for strings, and the > Schema is the convenient runtime structure to annotate. > [snip]
Thank you for the reply, Doug! Your reply has made me think harder about why this is an issue for us.
I think the reason is that we're storing the schema in the database, with an internal reference to the schema in each record. The stored schema is shared by all clients reading and writing records using that schema, even though clients operating on the same records may be have very distinct purposes: an OLTP application, a Map/Reduce job for data analysis, or a general purpose utility for data viewing.
The stored/shared schema must either have these string type properties, or not. If it does have them, this impacts the string type for all clients reading from the database. So they would have to all agree on the string type, or dynamically determine it.
So it is due to this sharing of the schema that I'm tending toward subclassing the DatumReader. That way, the string type is divorced from the shared schema, and each client can decide independently on the string type it wishes to use.
Does this makes sense to you as well?
--mark
-
Re: Why is the String type a Schema property?
Doug Cutting 2012-05-24, 18:51
On 05/24/2012 10:28 AM, Mark Hayes wrote: > The stored/shared schema must either have these string type properties, > or not. If it does have them, this impacts the string type for all > clients reading from the database. So they would have to all agree on > the string type, or dynamically determine it.
No, there are two schemas involved in reading, the writer's and the reader's. The reader's schema can determine what string representation is used. This is the case with reflect and specific, which resolve the schema used when writing against the schema of the class that's being used to represent things when reading. So you don't need to worry about reflect or specific, since they supply their own schema that has the string representation they expect.
So you only need to worry about different string representations if you're using the generic representation and do not specify a distinct reader's schema that you expect to see things as, or if you use some other kind of datum reader (e.g., one you've written yourself) that subclasses GenericDatumReader, doesn't override readString(), and you don't pass an expected, reader's schema.
Doug
-
Re: Why is the String type a Schema property?
Mark Hayes 2012-05-24, 20:33
On Thu, May 24, 2012 at 11:51 AM, Doug Cutting <[EMAIL PROTECTED]> wrote:
> No, there are two schemas involved in reading, the writer's and the > reader's. The reader's schema can determine what string representation is > used. This is the case with reflect and specific, which resolve the schema > used when writing against the schema of the class that's being used to > represent things when reading. So you don't need to worry about reflect or > specific, since they supply their own schema that has the string > representation they expect. > > So you only need to worry about different string representations if you're > using the generic representation and do not specify a distinct reader's > schema that you expect to see things as, or if you use some other kind of > datum reader (e.g., one you've written yourself) that subclasses > GenericDatumReader, doesn't override readString(), and you don't pass an > expected, reader's schema. > > Yes, you're right, it's only use of GenericRecord that is impacted. Sorry to confuse the issue.
Thanks for your help! --mark
-
Re: Why is the String type a Schema property?
Ey-Chih chow 2012-05-27, 01:58
Under the avro map/reduce framework, if we use the generic representation, how can we specify reader's schema? In addition, what kind of advantages we can get if we use the generic representation? Thanks.
Ey-Chih Chow
On May 24, 2012, at 11:51 AM, Doug Cutting wrote:
> On 05/24/2012 10:28 AM, Mark Hayes wrote: >> The stored/shared schema must either have these string type properties, >> or not. If it does have them, this impacts the string type for all >> clients reading from the database. So they would have to all agree on >> the string type, or dynamically determine it. > > No, there are two schemas involved in reading, the writer's and the reader's. The reader's schema can determine what string representation is used. This is the case with reflect and specific, which resolve the schema used when writing against the schema of the class that's being used to represent things when reading. So you don't need to worry about reflect or specific, since they supply their own schema that has the string representation they expect. > > So you only need to worry about different string representations if you're using the generic representation and do not specify a distinct reader's schema that you expect to see things as, or if you use some other kind of datum reader (e.g., one you've written yourself) that subclasses GenericDatumReader, doesn't override readString(), and you don't pass an expected, reader's schema. > > Doug
|
|