When ORC moved out of Hive, it didn’t bring the ACID work along. I’d like
to start working to remedy that. I wanted to give an outline of how I am
thinking of approaching it.
In general, I plan to focus on supporting the new split update (aka ACID
2.0) layout, where delta files contain either all inserts or all deletes
(updates are accomplished by putting in a delete and an insert). This is
what Hive supports in its trunk (but not in Hive 1 or 2).
I also plan to follow the ORC pattern of focusing on vectorized row batches
first, and then building the row by row readers and writers as shims on top
1) Build a version of RecordReader that can handle ACID files. This would
be roughly analogous to Hive’s VectorizedOrcAcidRowBatchReader.
2) I haven’t looked into the details here yet, but I assume I will need
some changes on the Writer side as well to handle writing out base versus
delta files as well as insert versus delete delta files.
3) Put the shims in place to support ORC equivalents to Hive’s
AcidInputFormat and AcidOutputFormat.
4) Change Hive to use the code now in ORC rather than duplicating this code
Should I do this in master or in a branch? In general I prefer to work in
master when possible. But I see a couple of reasons to branch:
1) This will require changes in Hive, some that aren’t released yet. For
example, this will depend on moving ValidTxnList to storage-api (which I
plan to do anyway, but haven’t yet). It would be convenient to be able to
depend on SNAPSHOT versions of storage-api rather than forcing a bunch of
releases. But I don’t want to do that in master because it can make it
hard for people to build and it makes releases impossible.
2) This is going to take a while and I suspect ORC will want to release
multiple times before it’s done. I’m not sure we want have half baked
features in the releases.