Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro, mail # user - Schema evolution and Specific vs Generic


Copy link to this message
-
Re: Schema evolution and Specific vs Generic
Martin Kleppmann 2013-12-04, 23:56
Hi Arvind,

The choice of specific vs. generic records should be orthogonal to schema
evolution — i.e. from a schema evolution point of view it doesn't matter
whether you're using the specific or the generic API. The same applies to
storage space — the binary encoding is exactly the same, whichever API you
use. I don't know about the performance difference; you're probably best
off benchmarking your use case for yourself if it's critical.

The question of when to re-generate specific records from the schemas
depends on what restrictions you impose on schema evolution. For example,
if the writer adds a field to a record, but the reader doesn't need that
field, the reader can happily continue using code generated from the old
schema, and the new field will be ignored. But if you want to add a symbol
to an enum, or a branch to a union type, the reader needs to be on the new
schema before it receives data written using the new enum symbol/union
branch, otherwise you'll get a runtime exception on decoding.

The rules for what is and what isn't compatible in schema evolution make
most sense if you consider how the data is actually encoded. This post of
mine attempts to explain it:
http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

Hope that helps,
Martin
On 4 December 2013 06:30, Arvind Kalyan <[EMAIL PROTECTED]> wrote:

> Hi folks, a high level question.
>
> Say we have readers and writers in different projects. The writer project
> dumps some data in some directory (or stores in a common store, etc) and
> the reader project picks up that data and uses its reader schema and the
> published writer schema (say we have a way to ship writer schemas along
> with the dataset).
>
> In that kind of setup where reader and writer schemas change at their own
> rate, and are their own projects, and they are going to ship data over the
> wire, how do you compare using SpecificRecords vs GenericRecords?
>
> 1. At what point would the reader project be forced to re-generate their
> Specific records from schemas? Every time writer schema changes in any way?
> every time a new field is added in the writer schema? When schema evolution
> support is critical and we have multiple projects writing and reading data
> over the wire, is the static typing provided by SpecificRecord going to be
> a bottleneck or is that not going to be a concern regardless of Generic or
> Specific Record?
>
> 2. In terms of efficiency and performance, have you noticed one performing
> better than the other in terms of serialized/deserialized storage space and
> cpu utilization?
>
> We are interested in using Specific records because it offers static
> compile time checks and ensures we are writing code to the correct field
> names and datatypes and such but would like to hear from the community what
> your thoughts are on this.
>
> Thanks!
>
> --
> Arvind Kalyan
>
>