Mason 2013-05-22, 17:34
-Re: ETL in face of column renames
Scott Carey 2013-05-23, 19:16
On 5/22/13 10:34 AM, "Mason" <[EMAIL PROTECTED]> wrote:
>I have what I imagine is a standard setup: a web application generates
>data in MySQL, which I want to analyze in Hadoop; I run a nightly
>process to extract tables of interest, Avroize, and dump into HDFS.
>This has worked great so far because the tools I'm using make it easy to
>load a directory tree of Avros with the same schema.
>The issue is what to do when schema changes occur in the SQL database. I
>believe column additions and deletions are handled automatically by the
>Avro loaders I'm using, but I need to deal with a column rename.
>My thinking is: I could bake the table schemas at time of ETL into the
>Avros, for historical record, but then manually copy that schema out as
>a "master" schema and apply it to all Avros for which it's appropriate;
>then when a column rename occurs, go back and edit the master schema.
>I've never used an external schema before, so please correct if I
>misunderstand how they work.
>Anyone have wisdom to share on this topic? I'd love to hear from anyone
>who has done this, or has a better solution.
The first thing that comes to mind is the alias feature for field names:
If you bare using Avro data files, these contain the schemas at the time
of writing for "historical record".
The trick is being able to distinguish between someone who renamed a
column from "foo" to "fubar" and a case where "foo" was removed and
"foobar" added. To do this, one has to have knowledge from the SQL
database DDL changes.
Once you have this, you can choose your reader schema appropriately --
likely by using the 'latest' schema decorated with field aliases where
appropriate, but there are other options.