Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Drill, mail # dev - question about schema

Copy link to this message
Re: question about schema
David Alves 2013-04-22, 04:28
Thank you for the sunday reply.
Your overview was pretty much what I was assuming in general (the info comes inside the Scan).
So in general the SE's optimizer rules will push the relevant ops inside the scan op, what gets pushed is, of course, SE dependent, but the definitions themselves are SE agnostic (just a bunch of physical ops that the SE will interpret internally).
The Scan op that reaches the SE itself is physical OP correct (and not the current logical op)?
We could even do something even a bit simpler like:


           internal: [{
           child: 1,
           select: [
               {expr: "CAST('fam1.qual1', int)", mode: "VECTOR"},
               {expr: "CAST('fam1.qual2', nvarchar)", mode: "VECTOR"}
               {mode: "VECTOR", type:"SQL_INT"},  // field 1
               {mode: "VECTOR", type:"SQL_NVARCHAR"}   // field 2
           {locations: ["hserver1.local"], table: "donuts", regionId:"1234"},
           {locations: ["hserver2.local"], table: "donuts", regionId:"5678"}
           output:[ // matches the output of the last internal op so we might even refer to it directy
               {mode: "VECTOR", type:"SQL_INT"},  // output field 1
is a value vector driven by expression 1.
               {mode: "VECTOR", type:"SQL_NVARCHAR"}   // field 2

Final question, all of this will be "strongly typed" correct? I mean these will be properties of the Scan physical op and not arbitrary json that maps to an SE specific *InputConfig?


On Apr 21, 2013, at 9:56 PM, Jacques Nadeau <[EMAIL PROTECTED]> wrote:

> Here is roughly what I'm thinking.
> Storage engines have three levels of configuration that they receive.
> 1. System-level configuration (such as picking the particular hbase
> cluster).  This is bound to StorageEngineConfig.
> 2. Scan-node level configuration to be applied across read entries for
> a particular scan entry.  I'm currently modeling this as ScanSettings
> locally.
> 3. Read-level settings for a particular portion of a scan.  (The
> subdivision for parallelization.)  (ReadEntry)
> Initially, a scan node will be used to describe an ambiguous scan
> where the output is unknown.  It is the project's responsibility to
> convert the scan output into the desired schema.  However, in the case
> that a particular SE supports some level of projection, that
> information would be pushed by the optimizer down to the scan node and
> the projection node will be removed (or modified to simplify the
> projection).  The storage engine should be able to receive the
> portions of the projection that it is responsible for via a TBD
> interface that then writes the information to the ScanSettings or the
> ReadEntry depending on where the Storage Engine wants it.
> Below I've given a rough example of what a pre-optimization and
> post-optimization physical plan might look like.  You can see how the
> output of the scan changed and the addition of the storage engine
> specific settings object.
>        {
>            @id:1,
>            pop:"scan",
>            storageengine:"hbase",
>            entries:[
>             {locations: ["hserver1.local"], table: "donuts", regionId:"1234"},
>             {locations: ["hserver2.local"], table: "donuts", regionId:"5678"}
>            ],
>            output: [
>                {mode: "VECTOR", type: "MAP"} //field 1
>            ]
>        },
>        {
>            @id:2,
>            child: 1,
>            pop:"project",
>            select: [
>                {expr: "CAST('fam1.qual1', int)", mode: "VECTOR"},
>                {expr: "CAST('fam1.qual2', nvarchar)", mode: "VECTOR"}
>            ],
>            output:[
>                {mode: "VECTOR", type:"SQL_INT"},  // field 1