Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Drill, mail # dev - Quick overview of HyperBatch concept.

Copy link to this message
Quick overview of HyperBatch concept.
Jacques Nadeau 2013-08-07, 01:53
Someone was asking me about the HyperBatch concept that a recent
commit introduced.  The idea is pretty simple.  We currently have a
two byte selection vector that we can use to mask a portion of a
columnar record batch before we rewrite it.  This is to help in
situations where the rewrite would be unwarranted given the subsequent
operator.  This works great for non-blocking operators.

In the case of blocking operators such as sort, this becomes a bit
harder.  (Especially in the case of schema changes, which I won't
discuss here.)  One solution is generating a this new thing called a
hyperbatch.  It looks kind of like a batch but it carries a
SelectionVector4 with it.  The SV4 describes not only the valid
records, but also their location within a set of multiple support
record batches.  This is encoded as two unsigned bytes for the record
batch index followed by two unsigned bytes for the individual record
(4B records max).  In these cases, a (hyper)batch doesn't hold a
ValueVector for each field but rather an indexed array of
ValueVectors.  This allows a pointer sort to completed without
rewriting the columnar oriented data until required (typically when
writing to disk or socket).  In the meantime, some additional
operators can be pipelined with only small modifications.  If we get
to the point that a particular operator no longer supports a SV4 input
batch, we insert a SelectionVectorRemover to rewrite the data to the
more standard record batch format.

You can see an example of the interaction at line 68 of this file: