Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Drill, mail # dev - question about schema


Copy link to this message
-
Re: question about schema
Jacques Nadeau 2013-04-22, 15:52
Try to flip the thinking about pushdown a little bit.  Using a
approach that encapsulates transformations in optimization rules helps
to simplify other pieces of the puzzle.  Optimization rules can expose
the types of relational patterns that they can transform (physical
operator specific) and then they internally manage the actual
transformation.  Basically, storage engines expose something akin to
getOptimizationRules() that returns an ordered list of optimization
rules.

You can see an example of how optiq handles this here:
https://github.com/julianhyde/optiq-splunk/blob/master/src/main/java/net/hydromatic/optiq/impl/splunk/SplunkPushDownRule.java

The rules are what understand the specifics of the transformation
(converting one or multiple relations into another).  They can
interact with the storage engine specialization/typed components of
the scan (e.g. what a column family is).

Initially, I liked the idea of Scan being the single top level concept
that contained Storage Engine specializations.  However, I think the
typing model is too weak and confusing.  We should probably just
specialize the Scan operator to have things like HBaseScan to simplify
later binding.  (In optiq this is called a TableAccessRel.)

All this being said, it is likely that there will be a core set of
optimization rules which could be mostly the same and cloned for
different storage engines (e.g. filtering and field-level project).

J

On Sun, Apr 21, 2013 at 9:28 PM, David Alves <[EMAIL PROTECTED]> wrote:
> Thank you for the sunday reply.
> Your overview was pretty much what I was assuming in general (the info comes inside the Scan).
> So in general the SE's optimizer rules will push the relevant ops inside the scan op, what gets pushed is, of course, SE dependent, but the definitions themselves are SE agnostic (just a bunch of physical ops that the SE will interpret internally).
> The Scan op that reaches the SE itself is physical OP correct (and not the current logical op)?
> We could even do something even a bit simpler like:
>
> ****POST-OPTIMIZATION****
>
>        {
>            @id:1,
>            pop:"scan",
>            storageengine:"hbase",
>            internal: [{
>            @id:2,
>            child: 1,
>            pop:"project",
>            select: [
>                {expr: "CAST('fam1.qual1', int)", mode: "VECTOR"},
>                {expr: "CAST('fam1.qual2', nvarchar)", mode: "VECTOR"}
>            ],
>            output:[
>                {mode: "VECTOR", type:"SQL_INT"},  // field 1
>                {mode: "VECTOR", type:"SQL_NVARCHAR"}   // field 2
>            ]
>        }],
>            entries:[
>                 {locations: ["hserver1.local"], table: "donuts", regionId:"1234"},
>                 {locations: ["hserver2.local"], table: "donuts", regionId:"5678"}
>            ],
>            output:[ // matches the output of the last internal op so we might even refer to it directy
>                {mode: "VECTOR", type:"SQL_INT"},  // output field 1
> is a value vector driven by expression 1.
>                {mode: "VECTOR", type:"SQL_NVARCHAR"}   // field 2
>            ]
>        },
>
> Final question, all of this will be "strongly typed" correct? I mean these will be properties of the Scan physical op and not arbitrary json that maps to an SE specific *InputConfig?
>
> -david
>
> On Apr 21, 2013, at 9:56 PM, Jacques Nadeau <[EMAIL PROTECTED]> wrote:
>
>> Here is roughly what I'm thinking.
>>
>> Storage engines have three levels of configuration that they receive.
>>
>> 1. System-level configuration (such as picking the particular hbase
>> cluster).  This is bound to StorageEngineConfig.
>> 2. Scan-node level configuration to be applied across read entries for
>> a particular scan entry.  I'm currently modeling this as ScanSettings
>> locally.
>> 3. Read-level settings for a particular portion of a scan.  (The
>> subdivision for parallelization.)  (ReadEntry)
>>
>> Initially, a scan node will be used to describe an ambiguous scan