Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro, mail # user - Why is the String type a Schema property?


Copy link to this message
-
Why is the String type a Schema property?
Mark Hayes 2012-05-24, 04:10
Hi,

This is my first post to this list.  I'm writing a binding API for our
database product, to allow users to easily store Avro binary data in the
database and use any of the built-in Avro object representations (Generic,
Specific, etc) as well as one we've added (JsonNode) by subclassing the
Generic classes.

In our binding API, we don't support object reuse.  So the Utf8 class has
no real benefit and String would be more convenient for our users.  I see
that type String can be used (rather than the Utf8 default) by two
different mechanisms: ( 1) I can override GenericDatumReader.readString, or
(2) I can set the "avro.java.string" property for each string field in the
schema to "String".

I would like to do (1)  because it is cleaner (the schema isn't cluttered
with metadata that is the same for every string field) and because I don't
think information about the object representation logically belongs in the
schema (for two users of the same schema, one may use an object
representation with String and the other a representation with Utf8).

So my question is:  Why is the string type a property in the schema, i.e.,
why does option (2) exist in Avro?  Is there something I'm missing about
its benefit?

Also, if I use option (1), is this likely to cause compatibility problems
with other components that process Avro data and Avro schemas, such as
Hadoop?  Our users may create a schema and store the data for that schema
in our database, and then later use the same schema for processing this
data in Hadoop.  Hadoop is just one example, since one of the reasons we
chose Avro is because of its widespread use in many components.  Does there
typically need to be agreement about the string type among different
entities that process data for a shared schema?

Thanks in advance for any advice.
--mark