-Re: Apache Drill and Schema
Julian Hyde 2013-05-13, 20:03
(For what it's worth, I believe that this philosophy is consistent with Jacques' vision for Drill.)
Optiq implements the SQL specification, and as such is strongly typed. Splunk and Drill are weakly typed, but in different ways.
In Splunk, there are a few core system fields, but every record can have a different set of fields. In a sense, the whole Splunk system is a single table with hundreds or thousands of fields, the set of distinct fields across all records. And of course that set of fields may change each time a record is added or removed.
The most convenient way to model a Splunk table was like this:
CREATE TABLE splunk (
source VARCHAR NOT NULL,
sourcetype VARCHAR NOT NULL,
There are two built-in fields "source" and "sourcetype", present in every record, and the "_extra ANY" declaration tells Optiq that you can ask for any other field. Those extra fields have type VARCHAR (in splunk all fields are strings) and their value in SQL will be NULL if they don't exist in a particular row. Thus you can write
SELECT s."source", s."sourcetype", s."action", CAST(s."discount" AS INTEGER)
FROM "splunk"."splunk" AS s
WHERE s."action" = 'purchase'
To implement the Optiq SQL front-end to Drill, I added support for the JSON object model to SQL by adding MAP and ARRAY type constructors and a variant type ANY. Thus Drill's "donuts" example table appears in SQL as
CREATE TABLE donuts (
_map MAP(VARCHAR, ANY) NOT NULL);
This is strongly typed, and therefore consistent with the SQL standard. Every row has a single column called "_map", and its type is MAP(VARCHAR, ANY). (Equivalent to a Java field "final Map<String, Object> map;".) If you want to access the fields of the "_map" column you use the "[ ... ]" operator and casting.
To access the "ppu" field, you could write "SELECT CAST(_map['ppu'] AS INTEGER) as ppu FROM donuts". I would like to add syntactic sugar so that you can write "SELECT ppu FROM donuts" (implicitly inserting the "_map" field, just as Java inserts "this." before field references, and converting "x.field" to "x['field']" for any field of type MAP).
If later you know the schema, you can create a view:
CREATE VIEW donuts_typed AS
SELECT CAST(_map['name'] AS VARCHAR) AS name,
CAST(_map['ppu'] AS FLOAT) AS ppu
You can build your applications on top of such views at little or no loss of efficiency.
These illustrate ways that we can add strongly-typed standard SQL on top of a system with late or no schema. It is still standard SQL, which everyone agrees is very important. The sugarings don't damage the core language; if you don't like them, don't use them.
The biggest shortfall between the systems I've described above and what you seem to want is enumerating what columns are available. (Your examples all contain 'select *'.) In strongly typed SQL, if you want to know what columns are available, you ask the catalog (i.e. you use metadata). In a weakly typed system, you issue a query (i.e. you use data). If you issue "select *" against Optiq-for-Splunk or Optiq-for-Drill you will be disappointed. The column list will expand to "source, sourcetype" and "_map" respectively, because that is the only metadata known to the system before the query is run.
The easiest solution is to ask your users to specify explicit columns rather than writing "select *". (Machine-generated queries, such as my own Mondrian engine, already tend to do this.) Or, as I suggested earlier, go ahead and provide strongly typed views for your users to use. Or, one could build a UI to assist writing queries: the UI could suggest which columns to ask for, based on sketches of the data, previous query results, and columns that are already used in the query.