I have what I imagine is a standard setup: a web application generates
data in MySQL, which I want to analyze in Hadoop; I run a nightly
process to extract tables of interest, Avroize, and dump into HDFS.
This has worked great so far because the tools I'm using make it easy to
load a directory tree of Avros with the same schema.
The issue is what to do when schema changes occur in the SQL database. I
believe column additions and deletions are handled automatically by the
Avro loaders I'm using, but I need to deal with a column rename.
My thinking is: I could bake the table schemas at time of ETL into the
Avros, for historical record, but then manually copy that schema out as
a "master" schema and apply it to all Avros for which it's appropriate;
then when a column rename occurs, go back and edit the master schema.
I've never used an external schema before, so please correct if I
misunderstand how they work.
Anyone have wisdom to share on this topic? I'd love to hear from anyone
who has done this, or has a better solution.