-Re: question about completely untagged data...
Bruce Mitchener 2010-11-29, 04:44
To be clear, HAvroBase stores tuples of (schema ID, data) and then looks up
the schema from that ID. It doesn't store each schema separately / entirely
alongside the corresponding data records / entries.
HAvroBase is really pretty nice and has backends for storing data into
things other than HBase...
On Mon, Nov 29, 2010 at 11:09 AM, Philip Zeyliger <[EMAIL PROTECTED]>wrote:
> Hi David,
> Your assessment of Thrift and Avro being isomorphic is correct, and
> you've correctly identified the major philosophical difference. (It's
> in fact a little bit deeper than you suggest: at read time, there are
> always two schemas available: the reader's schema and the original
> schema that the data was written with.)
> Where are you storing the Avro records? Avro's binary format for
> records is unlikely to change: it's pretty stable and changing would
> be a big deal. On the other hand, Avro already has multiple ways for
> passing schema information along. Avro's RPC implementations do one
> thing. Avro Data File store the schema in the header. You could, in
> your system, always store (schema, data) tuples. That's what Sam is
> doing in HAvroBase
> -- Philip
> On Sun, Nov 28, 2010 at 6:39 PM, David Jeske <[EMAIL PROTECTED]> wrote:
> > I have a storage project considering adding Thrift or Avro to for record
> > packing, and I have a couple questions.
> > Other than than type-id and field-ids, Avro and Thrift's designs seem
> > isomorphic. Is the binary format not including field-type-info something
> > that's set in stone, or something that's open for feedback?
> > I prefer the philosophy of Avro, namely to expect schemas to be
> > use those schemas for compatibility mapping, and to support dynamic
> > parsing in any supported language. In fact, being able to parse schemas
> > dynamically in any language is the real draw of Avro for me. (personally
> > prefer if they were actually Avro IDL, instead of JSON, but I understand
> > that might complicate implementing client stubs).
> > However, the fact that data is not tagged with any type-information is
> > unacceptable dangerous for my application. There will be mechanisms for
> > mapping records to schemas, and schemas will be stored, but if a schema
> > ever lost or corrupted, I can't afford for the data to turn into total
> > Unless data is trivially small, encoding a field type wouldn't change the
> > size of the encoding much, but would provide some 'sanity checking' in
> > addition to be able to recover the raw data even if a schema was lost or
> > schema ID for a piece of data was corrupted.
> > Since Avro is relatively new, I'm asking to find out if this is anathama
> > the entire concept of Avro, or something something that was chosen, but
> > might be reconsidered eventually.
> > Going the thrift route for me will mean injecting a bit of the Avro
> > philosophy into Thrift, namely, adding a Thrift IDL parser to the
> language I
> > need, so I can save Thrift IDLs and then dynamically read them. However,
> > doing this as a one-off for my language different than having a supported
> > mechanism for all client languages -- like in Avro.