Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Drill >> mail # dev >> First pass at a reference interpreter


Copy link to this message
-
Re: First pass at a reference interpreter

Cool stuff, Jacques - will give it a shot ASAP!

Cheers,
Michael

--
Michael Hausenblas
Ireland, Europe
http://mhausenblas.info/

On 14 Jan 2013, at 15:56, Jacques Nadeau <[EMAIL PROTECTED]> wrote:

> I've been pulling together a reference logical plan interpreter.  I'm
> working with Ted to get it inside the Drill sandbox.  For now, you can find
> it on my repo at https://github.com/jacques-n/incubator-drill (prototype
> branch)
>
>
>
> The goals of the reference interpreter are:
>
>
>   - To provide a simple way to run a Logical Plan against some sample data
>   and get back the expected result
>   - Allow work to start on the parsers while we scale up the performance
>   and capabilities of the execution engine and optimizer.
>   - Allow evaluation work on particular technical approaches such as
>   exploring the impact of hierarchical and schema less data on query
>   evaluation.
>
> These goals do not include performance, memory handling, or
> efficiency.  Currently,
> the interpreter is a single node/thread process.  This will change shortly
> so that it also run as a clustered process.
>
> The entry point is inside the /sandbox/prototype/exec/ref module:
> org.apache.drill.exec.ref.ReferenceInterpreter.main();  The example program
> utilizes two resources: simple-plan.json and donuts.json and outputs data
> to /opt/data/out.json.
>
>
> Some of things that 'work'.
>
>
>   - Read/write basic json.
>   - ROPs (reference operators): Filter, Transform, Group, Aggregate
>   (simple), Order, Union.
>   - Example aggregate and basic functions including sum, count, multiply,
>   add, compare, equals.
>
> Basic glossary/concepts (we'll get this on the wiki/javadocs):
>
>
>   - LOP: Logical Operator.  An implementation agnostic data flow operator
>   utilized by the Logical Plan.
>   - ROP: Reference Operator: A reference operator implementation that
>   pairs with a LOP.
>   - FunctionDefinition: A definition of a particular function.  Describes
>   a set of aliases, an allowable set of input arguments and an interface that
>   will attempt to determine output type.
>   - BasicEvaluator: An implementation of a particular non-aggregate
>   expression.  Receives a record pointer at creation time. Returns a
>   DataValue.
>   - AggregateEvaluator: An implementation of a particular aggregating
>   function.  Is provided a record pointer at creation time.  Expects regular
>   calls to addRecord() followed by a call to eval() which provides the
>   aggregate value.
>   - DataValue: A pointer to a particular data value.  Implementation
>   classes includes things like ScalarLong, ScalarBytes, SimpleMapValue and
>   SimpleArrayValue.
>
> The standard record iterator utilized between each ROP utilizes the
> org.apache.drill.exec.ref.RecordIterator interface.  This is somewhat
> inspired by the AttributeSource concepts from within the Lucene project.
> (I'm planning to extend these concepts all the way to the individual
> DataValues.)
>
>
>
> My next goals are to add tests, finish adding ROPs, add local and remote
> exchange nodes (parallelization), add a bunch of documentation and extract
> out the Execution plan as a separate intermediate representation.
>
>
>
> It needs a lot more evaluators to be a true reference interpreter (as well
> as the rest of the ROPs).  The existing ones can be utilized as prototypes.
> Anyone interested in ripping through a bunch of additional evaluators and
> associated FunctionDefinitions?