-Re: Why is the String type a Schema property?
Doug Cutting 2012-05-24, 16:08
On 05/23/2012 09:10 PM, Mark Hayes wrote:
> So my question is: Why is the string type a property in the schema,
> i.e., why does option (2) exist in Avro? Is there something I'm missing
> about its benefit?
It's for back compatibility. Strings in specific and generic
representations were originally always read as Utf8, so many existing
applications expect strings to be Utf8. Rather than breaking all of
these applications we instead permitted folks to opt in to this change.
For applications that use the specific representation (those that
generate code) and wish to change from Utf8 to String it requires only
adding a single parameter to their Maven configuration, so it's not very
invasive. The runtime must know which representation is desired for
strings, and the Schema is the convenient runtime structure to annotate.
Note that we'd prefer not to instead make it a property of the
Encoder/Decoder or DatumWriter/DatumReader since we permit folks to
intermix reflect, specific and generic objects in a tree. For example,
one may have a reflected datum that has some fields which are defined by
generated specific classes and other fields which correspond to no class
on the classpath so the generic representation is used. This
flexibility permits classes like org.apache.avro.mapred.Pair<X,Y>, which
can contain reflect, specific or generic instances.
> Also, if I use option (1), is this likely to cause compatibility
> problems with other components that process Avro data and Avro schemas,
> such as Hadoop?
No, I don't think so. If you use your own DatumReader implementation to
read your data then that should not affect anyone else. Reflect,
specific and generic inherit from one another, sharing many parts of
their implementation, so changes to these must keep the others in mind,
but if you've defined a new DatumReader that's only used to read your
data that should not affect any other applications.
> Our users may create a schema and store the data for
> that schema in our database, and then later use the same schema for
> processing this data in Hadoop. Hadoop is just one example, since one
> of the reasons we chose Avro is because of its widespread use in many
> components. Does there typically need to be agreement about the string
> type among different entities that process data for a shared schema?
Not really. If you're reading things that correspond to a generated
specific class then it will always use the representation it expects,
since generated code contains its schema. If you use reflection to read
things into instances of a non-generated class then it will generally
read strings as java.lang.String. The generic representation will use
Utf8 for unannotated string schemas. Your map and reduce functions will
need to be written accordingly.
I hope this helps!