Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro, mail # user - question about completely untagged data...


Copy link to this message
-
Re: question about completely untagged data...
David Jeske 2010-11-29, 04:40
On Sun, Nov 28, 2010 at 8:09 PM, Philip Zeyliger <[EMAIL PROTECTED]>wrote:

> Where are you storing the Avro records?
This is part of a database/storage project. To avoid the overhead of a
schema-per record, I can store a schema-ID per record, and then have a
directory of schemas in the system. However, if somehow the place that the
schema is stored gets botched (id gets corrupted, schema file gets corrupted
or lost, etc), the records would become completely unintelligible. That
sounds like a scarry prospect.
> You could, in your system, always store (schema, data) tuples.  That's what
> Sam is
> doing in HAvroBase
> (
> http://www.javarants.com/2010/06/30/havrobase-a-searchable-evolvable-entity-store-on-top-of-hbase-and-solr/
> ).
>

This sounds fine for something where records are documents and record sizes
are quite big. However, in my application there are going to be too many
records proportional to (or smaller than) the schema-size for this to be
practical. Without compression this would double or triple the data-size.
With compression it's a bunch of unnecessary extra work decoding and
rencoding schemas.

I suppose I could use Avro's API to dump a "dense binary type-only schema"
that wouldn't have the names of types, only a packed format of the types
themselves. This would essentially be the same as Thrift, except that the
types would be packed at the beginning of the record (or end) instead of
interspersed with the records. In the common case Avro would be handed the
"real" schema any (old and new), so it wouldn't even be looking at this. It
would just be in there for safety sake in case we needed to do some disaster
recovery.