Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Schema evolution and projection


Copy link to this message
-
Re: Schema evolution and projection
> There doesn't seem to be much information available on how to perform
> these tasks. The examples on the C API page confusingly mix the old
> datum API with the new value API.

Apologies for that — you're absolutely right that we need to clean up
the C API documentation a bit.

> Is this how schema projection is supposed to work? Does it just return
> items of the same type irrespective of the field name specified?

tl;dr — The schema projection doesn't happen for free; you need to use a
"resolved writer" to perform the schema resolution.

In the C API, when you open an Avro file for reading, we expect that the
avro_value_t that you pass in to avro_file_reader_read_value has the
*exact same* schema that was used to create the file.  So in your first
example (gist 5056626), your read_archive_test function works great
since it's explicitly asking the file for the writer schema, and using
that to create the value instance to read into.  If you know that you
want to read exactly what's in the file, not perform any schema
resolution, and (optionally) dynamically interrogate the writer schema
to see what fields are available, this is exactly the right approach.

On the other hand, if you want to use schema resolution to project away
some of the fields (or to do other interesting data conversions), you
need to create a resolved writer to perform that schema resolution.  The
resolved writer is an avro_value_iface_t that wraps up the schema
resolution rules for a particular writer schema and reader schema.  When
you create an avro_value_t instance of the resolved writer, it looks
like it's an instance of the writer schema, and it wraps an instance of
the reader schema.  Since the resolved writer value is an instance of
the writer schema, you can read data into it using
avro_file_reader_read_value.  Under the covers, it will perform the
schema resolution and fill in the wrapped reader schema instance.  You
can then read the projected data out of your reader value.

In English that's probably still a bit too dense of an explanation; I'll
whip together an example program and post it as a gist so that you can
see it in actual code.

(As an aside, the reason original projection_test worked the way that it
did is because a single "record { int, int }" value happens to have the
same serialization as two consecutive "int" values.
avro_file_reader_read_value doesn't do any schema resolution, it just
tries to read a value of the type that you pass in.)

cheers
–doug