Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Drill >> mail # dev >> Quick overview of HyperBatch concept.

Copy link to this message
Re: Quick overview of HyperBatch concept.
Selection vector is the same.   Not sure whether either of the others
embrace hyperbatch or new for Drill.

On Aug 6, 2013 7:27 PM, "Timothy Chen" <[EMAIL PROTECTED]> wrote:

> Ah gotcha, it's the same concept in MonetDB and what Hive batch query
> engine is using too. Didn't know they call it HyperBatch (unless you
> invented it?)
> Tim
> On Tue, Aug 6, 2013 at 6:53 PM, Jacques Nadeau <[EMAIL PROTECTED]> wrote:
> > Someone was asking me about the HyperBatch concept that a recent
> > commit introduced.  The idea is pretty simple.  We currently have a
> > two byte selection vector that we can use to mask a portion of a
> > columnar record batch before we rewrite it.  This is to help in
> > situations where the rewrite would be unwarranted given the subsequent
> > operator.  This works great for non-blocking operators.
> >
> > In the case of blocking operators such as sort, this becomes a bit
> > harder.  (Especially in the case of schema changes, which I won't
> > discuss here.)  One solution is generating a this new thing called a
> > hyperbatch.  It looks kind of like a batch but it carries a
> > SelectionVector4 with it.  The SV4 describes not only the valid
> > records, but also their location within a set of multiple support
> > record batches.  This is encoded as two unsigned bytes for the record
> > batch index followed by two unsigned bytes for the individual record
> > (4B records max).  In these cases, a (hyper)batch doesn't hold a
> > ValueVector for each field but rather an indexed array of
> > ValueVectors.  This allows a pointer sort to completed without
> > rewriting the columnar oriented data until required (typically when
> > writing to disk or socket).  In the meantime, some additional
> > operators can be pipelined with only small modifications.  If we get
> > to the point that a particular operator no longer supports a SV4 input
> > batch, we insert a SelectionVectorRemover to rewrite the data to the
> > more standard record batch format.
> >
> > You can see an example of the interaction at line 68 of this file:
> >
> >
> https://github.com/apache/incubator-drill/blob/db3afaa854fc8475592907dba97162ecf869f9df/sandbox/prototype/exec/java-exec/src/main/java/org/apache/drill/exec/expr/CodeGenerator.java
> >
> >
> > thanks,
> > Jacques
> >