Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro, mail # user - question about completely untagged data...


Copy link to this message
-
Re: question about completely untagged data...
Philip Zeyliger 2010-11-29, 04:09
Hi David,

Your assessment of Thrift and Avro being isomorphic is correct, and
you've correctly identified the major philosophical difference.  (It's
in fact a little bit deeper than you suggest: at read time, there are
always two schemas available: the reader's schema and the original
schema that the data was written with.)

Where are you storing the Avro records?  Avro's binary format for
records is unlikely to change: it's pretty stable and changing would
be a big deal.  On the other hand, Avro already has multiple ways for
passing schema information along.  Avro's RPC implementations do one
thing.  Avro Data File store the schema in the header.  You could, in
your system, always store (schema, data) tuples.  That's what Sam is
doing in HAvroBase
(http://www.javarants.com/2010/06/30/havrobase-a-searchable-evolvable-entity-store-on-top-of-hbase-and-solr/).

-- Philip

On Sun, Nov 28, 2010 at 6:39 PM, David Jeske <[EMAIL PROTECTED]> wrote:
> I have a storage project considering adding Thrift or Avro to for record
> packing, and I have a couple questions.
> Other than than type-id and field-ids, Avro and Thrift's designs seem
> isomorphic. Is the binary format not including field-type-info something
> that's set in stone, or something that's open for feedback?
> I prefer the philosophy of Avro, namely to expect schemas to be available,
> use those schemas for compatibility mapping, and to support dynamic schema
> parsing in any supported language. In fact, being able to parse schemas
> dynamically in any language is the real draw of Avro for me. (personally I'd
> prefer if they were actually Avro IDL, instead of JSON, but I understand
> that might complicate implementing client stubs).
> However, the fact that data is not tagged with any type-information is
> unacceptable dangerous for my application. There will be mechanisms for
> mapping records to schemas, and schemas will be stored, but if a schema were
> ever lost or corrupted, I can't afford for the data to turn into total junk.
> Unless data is trivially small, encoding a field type wouldn't change the
> size of the encoding much, but would provide some 'sanity checking' in
> addition to be able to recover the raw data even if a schema was lost or the
> schema ID for a piece of data was corrupted.
> Since Avro is relatively new, I'm asking to find out if this is anathama to
> the entire concept of Avro, or something something that was chosen, but
> might be reconsidered eventually.
> Going the thrift route for me will mean injecting a bit of the Avro
> philosophy into Thrift, namely, adding a Thrift IDL parser to the language I
> need, so I can save Thrift IDLs and then dynamically read them. However,
> doing this as a one-off for my language different than having a supported
> mechanism for all client languages -- like in Avro.
>
>