Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Drill >> mail # dev >> question about schema


Copy link to this message
-
Re: question about schema
Here is roughly what I'm thinking.

Storage engines have three levels of configuration that they receive.

1. System-level configuration (such as picking the particular hbase
cluster).  This is bound to StorageEngineConfig.
2. Scan-node level configuration to be applied across read entries for
a particular scan entry.  I'm currently modeling this as ScanSettings
locally.
3. Read-level settings for a particular portion of a scan.  (The
subdivision for parallelization.)  (ReadEntry)

Initially, a scan node will be used to describe an ambiguous scan
where the output is unknown.  It is the project's responsibility to
convert the scan output into the desired schema.  However, in the case
that a particular SE supports some level of projection, that
information would be pushed by the optimizer down to the scan node and
the projection node will be removed (or modified to simplify the
projection).  The storage engine should be able to receive the
portions of the projection that it is responsible for via a TBD
interface that then writes the information to the ScanSettings or the
ReadEntry depending on where the Storage Engine wants it.

Below I've given a rough example of what a pre-optimization and
post-optimization physical plan might look like.  You can see how the
output of the scan changed and the addition of the storage engine
specific settings object.
****PRE-OPTIMIZATION****
        {
            @id:1,
            pop:"scan",
            storageengine:"hbase",
            entries:[
             {locations: ["hserver1.local"], table: "donuts", regionId:"1234"},
             {locations: ["hserver2.local"], table: "donuts", regionId:"5678"}
            ],
            output: [
                {mode: "VECTOR", type: "MAP"} //field 1
            ]
        },
        {
            @id:2,
            child: 1,
            pop:"project",
            select: [
                {expr: "CAST('fam1.qual1', int)", mode: "VECTOR"},
                {expr: "CAST('fam1.qual2', nvarchar)", mode: "VECTOR"}
            ],
            output:[
                {mode: "VECTOR", type:"SQL_INT"},  // field 1
                {mode: "VECTOR", type:"SQL_NVARCHAR"}   // field 2
            ]
        },

****POST-OPTIMIZATION****

        {
            @id:1,
            pop:"scan",
            storageengine:"hbase",
            settings: {
              fields: [
                {family: "fam1", qualifier: "qual1", convert: "INT",
output-mode: "VECTOR"},
                {family: "fam1", qualifier: "qual2", convert: "UTF8",
output-mode: "VECTOR"}
              ]
            },
            entries:[
             {locations: ["hserver1.local"], table: "donuts", regionId:"1234"},
             {locations: ["hserver2.local"], table: "donuts", regionId:"5678"}
            ],
            output:[
                {mode: "VECTOR", type:"SQL_INT"},  // output field 1
is a value vector driven by expression 1.
                {mode: "VECTOR", type:"SQL_NVARCHAR"}   // field 2
            ]
        },

On Sat, Apr 20, 2013 at 10:45 PM, David Alves <[EMAIL PROTECTED]> wrote:
>
> had a "duh" moment, realizing that, of course, I don't need a ProjectFilter as I can set the relevant cq's and cf's on HBase's Scan.
> the question or how to get the names of the columns the query is asking for or even "*" if that is the case, still stands though…
>
> -david
>
> On Apr 20, 2013, at 10:39 PM, David Alves <[EMAIL PROTECTED]> wrote:
>
> > Hi Jacques
> >
> >       I'm implementing a ProjectFilter for HBase and I got to the point where I need to pass to HBase the fields that are required (even if it's simply "all" as in *).
> >       How to know which fields to scan in the SE and their expected type?
> >       There's a bunch of schema stuff in the org/apache/drill/exec/schema but I can't figure how SE uses that.
> >       Will this info come inside the scan logical op in getReadEntries(Scan scan) (in the arbitrary "selection" section)?
> >       Is this method still going to receive a logical Scan op or is this just a legacy stuff that you didn't have the chance to get to yet?
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB