-Next steps, "hello world" queries, and rants about query languages
Julian Hyde 2012-09-19, 20:33
On Sep 18, 2012, at 4:09 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> I think that the key next steps are:
> 1) codify some API's (Jason and Julian are on this)
> 2) build out the "hello world" app using the available parser and Optiq and
> a trivial data model (#1 is not a blocker for this)
> 3) evaluate where we are implications are relative to nested data and
> columnar formats
It would be really useful to establish a list of queries. Minimal, but demonstrate the key features (e.g. "COUNT(a.b) WITHIN RECORD", the ability to aggregate over nested collections.
> Once we have that, we should be able to open up the parallelism of
> development massively.
> Some open questions/tasks that many people may be able to determine include:
> - is it possible to change Optiq operators and optimization rules without a
Yes. (From Optiq's perspective, anyway, it is very straightforward. Optiq starts from scratch each time a query is prepared, and takes whatever metadata sources, operators and rules it is given.)
> - what has to happen to make Optiq handle nested data?
I am still mulling the answer to that question. The list of queries I mentioned above will help me understand what exactly is "nested data" and the operations on it.
I have been using http://developers.google.com/bigquery/docs/query-reference as my main resource for BigQuery/DrQL. Let me know if there is a more authoritative source.
For the record (I'll only make this whine once, I promise) the BigQuery language has some usability flaws, particularly inconsistencies with SQL. Some of these may have been deliberate (for instance, to try to make the language more concise for the user to type), or may have been made out of ignorance for SQL.
* The paper has an expression like "COUNT(x > 5)". SQL's COUNT operator counts not-NULL values, not boolean TRUE values, so in SQL this expression would not have the desired effect
* My jaw dropped when I read "Unlike many other SQL-based systems, BigQuery uses the comma syntax to indicate table unions, not joins". I see how commas are a convenient abbreviation for UNION, but that's like redefining "if" in a programming language.
* "Note that the HAVING clause can only refer to fields defined in your SELECT clause (if the field has an alias, you must use it; if it doesn't, use the aggregate field name instead)." This kind of design decision makes the language difficult to generate for.
* Apparently string literals can be enclosed in either single or double quotes. This may be for the convenience of the user, but is at odds with SQL, which allows only single quotes. (Many SQL dialects use double quotes to quote identifiers that contain mixed case or spaces.)
* BigQuery allows table names to be prefixed with an optional "projectname:".
* The syntax for accessing nested collections at depth 2 or more. If "a" is a table alias, and b is a collection-valued field, then "a.b" is the collection of sub-records, and "a.b.c" is the union of the collections of the "c" field of all records in the "a.b" collection. In 'a.b.c', the first '.' is the conventional operator that accesses a field of a record. But the second '.' is a strange beast that operates on a set of records.
Most of these I offer as evidence that DrQL is not a superset of SQL. I suppose we can vive la difference, implement separate parsers/validators for the two languages, including implementing DrQL "features" we think are ill-advised.
But I was looking to DrQL to find out how to query nested collections, and I didn't find much depth. How, for instance, to convert a "dept" relation with nested "emp" records into a "flat" relation? Or given an "shipment.order.lineitem" nested relation, sum up lineitem.discount only in orders that have "order.prepaid=true"? Are nested collections ordered?
SQL's support for nested collections is powerful & consistent. (But not very concise, or intuitive for the non-expert.) It has operators like UNNEST that convert a nested collection into a relation (note that collections and relations are different beasts), and operators such as CARDINALITY to aggregate them. (My friend John Sichi wrote a great overview here: http://farrago.sourceforge.net/design/CollectionTypes.html.)
To be clear, I am not advocating adding SQL's collection constructs to Drill's query language. But I am drawing inspiration from them when designing how Optiq would represent collections, because a query planner needs semantics to be very, very precise. I can see how most of DrQL's constructs would map onto SQL's constructs.
I'll have a better answer to the "what has to happen to make Optiq handle nested data" question in a day or two.