I disagree with number 4.  I work on Acid a lot and VectorizedOrcAcidRowBatchReader and
OrcRawRecordMerger (as well as the Write path) are modified quite often.  Moving this logic
into a separate project will add large burden of having to release ORC project to make
progress on Hive features.  

If Hive is refactored to have a pluggable Acid reader such that Hive contains an
implementation (exactly as it does now) then Hive ACID’s dependency on ORC is not increased.  
ORC can create its own implementation so that it can be used by projects using ORC directly, w/o Hive.


On 9/13/17, 11:34 AM, "Alan Gates" <[EMAIL PROTECTED]> wrote:

    When ORC moved out of Hive, it didn’t bring the ACID work along.  I’d like
    to start working to remedy that.  I wanted to give an outline of how I am
    thinking of approaching it.
    In general, I plan to focus on supporting the new split update (aka ACID
    2.0) layout, where delta files contain either all inserts or all deletes
    (updates are accomplished by putting in a delete and an insert).  This is
    what Hive supports in its trunk (but not in Hive 1 or 2).
    I also plan to follow the ORC pattern of focusing on vectorized row batches
    first, and then building the row by row readers and writers as shims on top
    of this.
    Proposed plan:
    1) Build a version of RecordReader that can handle ACID files.  This would
    be roughly analogous to Hive’s VectorizedOrcAcidRowBatchReader.
    2) I haven’t looked into the details here yet, but I assume I will need
    some changes on the Writer side as well to handle writing out base versus
    delta files as well as insert versus delete delta files.
    3) Put the shims in place to support ORC equivalents to Hive’s
    AcidInputFormat and AcidOutputFormat.
    4) Change Hive to use the code now in ORC rather than duplicating this code
    in Hive.
    Seem reasonable?
    Should I do this in master or in a branch?  In general I prefer to work in
    master when possible.  But I see a couple of reasons to branch:
    1) This will require changes in Hive, some that aren’t released yet.  For
    example, this will depend on moving ValidTxnList to storage-api (which I
    plan to do anyway, but haven’t yet).  It would be convenient to be able to
    depend on SNAPSHOT versions of storage-api rather than forcing a bunch of
    releases.  But I don’t want to do that in master because it can make it
    hard for people to build and it makes releases impossible.
    2) This is going to take a while and I suspect ORC will want to release
    multiple times before it’s done.  I’m not sure we want have half baked
    features in the releases.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB