Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Why is the String type a Schema property?


+
Mark Hayes 2012-05-24, 04:10
+
Doug Cutting 2012-05-24, 16:08
Copy link to this message
-
Re: Why is the String type a Schema property?
On Thu, May 24, 2012 at 9:08 AM, Doug Cutting <[EMAIL PROTECTED]> wrote:

> On 05/23/2012 09:10 PM, Mark Hayes wrote:
>
>> So my question is:  Why is the string type a property in the schema,
>> i.e., why does option (2) exist in Avro?  Is there something I'm missing
>> about its benefit?
>>
>
> It's for back compatibility.  Strings in specific and generic
> representations were originally always read as Utf8, so many existing
> applications expect strings to be Utf8.  Rather than breaking all of these
> applications we instead permitted folks to opt in to this change.  For
> applications that use the specific representation (those that generate
> code) and wish to change from Utf8 to String it requires only adding a
> single parameter to their Maven configuration, so it's not very invasive.
>  The runtime must know which representation is desired for strings, and the
> Schema is the convenient runtime structure to annotate.
>
[snip]

Thank you for the reply, Doug!  Your reply has made me think harder about
why this is an issue for us.

I think the reason is that we're storing the schema in the database, with
an internal reference to the schema in each record.  The stored schema is
shared by all clients reading and writing records using that schema, even
though clients operating on the same records may be have very distinct
purposes: an OLTP application, a Map/Reduce job for data analysis, or a
general purpose utility for data viewing.

The stored/shared schema must either have these string type properties, or
not.  If it does have them, this impacts the string type for all clients
reading from the database.  So they would have to all agree on the string
type, or dynamically determine it.

So it is due to this sharing of the schema that I'm tending toward
subclassing the DatumReader.  That way, the string type is divorced from
the shared schema, and each client can decide independently on the string
type it wishes to use.

Does this makes sense to you as well?

--mark
+
Doug Cutting 2012-05-24, 18:51
+
Mark Hayes 2012-05-24, 20:33
+
Ey-Chih chow 2012-05-27, 01:58
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB