-Fwd: Apache Drill and Schema
Michael Hausenblas 2013-05-11, 18:43
After a recent Apache Drill talk in Berlin, one of the people in the audience, Gunnar (in CC) raised some very interesting schema-related challenges. He was so kind to write up his questions, see below. I hope we do, as a community, a good job addressing these.
Again, thanks a lot for your interest in Drill, Gunnar, and hope you'll find time to test drive it once it's available in all its beauty!
Begin forwarded message:
> From: Gunnar Schröder <[EMAIL PROTECTED]>
> Subject: Apache Drill and Schema
> Date: 9 May 2013 22:15:59 GMT+01:00
> Hi Michael,
> as I promised I e-mail you some questions regarding Apache Drill, that crossed my mind when hearing your talk. Since you might want to forward this e-mail or parts of it to the mailing list, I will write you in English. :-)
> The question that puzzled me most after hearing your talk was how Apache Drill handles schema information about the data in any of its sources.
> RDBMSs for example have a very strict schema definition (Schema on write). Hive and Impala use the metastore to define a very similar schema definition, that must be available before querying data (Schema on Read). This schema is necessary for Hive to query HBase which has a more flexible schema, e.g. dynamic columns and no explicit types for data.
> I have problems understanding, how you want to implement SQL as query language (or any other high level query language) if Schema information is not or only partly available? If you reuse the schema information of Hive in its metastore or relational database such as MySQL you are fine. But how do you handle a raw CSV file or HBase without Hive schema information?
> Some simple examples:
> Select * from table order by col1;
> If the table is a CSV file or HBase table, how do you determine whether the ordering is done using string sort order or numeric order?
> Select * from table;
> I'm no expert on this, but I assume that if you query a table using JDBC or ODBC you need full schema information before you obtain the first result of the query.
> Select * from table1, table2 where table1.col1 = table2.col2;
> Let's assume the column to join on is a string. Some storage engines may make a distinction between null and empty string. Some storage engines may be non first normal form and may have multiple strings.
> These are just some examples, but I have doubts, whether it is possible to implement SQL 2003 as query language without full schema information.
> I scrolled through the Apache Drill design document and it mentions a metadata repository but the information regarding schema and typing is very sketchy.
> My recommendation would be:
> Start out by reusing the Hive metadata repository and implement Apache Drill on top of this schema information. Then gradually replace or supplement this with your own (possibly more dynamic and flexible) metadata repository.
> To be successful:
> - Keep it Apache Open Source Licence
> - Your major competitor is Impala, so think about in which respect you can be better
> - SQL and JDBC is very important to be successful (interface to DBs and reporting tools)
> - HBase would be great with a low latency SQL querying facility (different attempts are ongoing)
> - Provide interfaces to plug in new storage or file format handlers (this is a strength of Hive, but seems hard to do with Impala), e.g. to access files of your own particular format within HDFS
> Maybe these random thought about Apache Drill are useful for you. I would love to see a true open source alternative to Impala, which supplements Hive for real time queries, that is implemented in Java and provides pluggable interfaces to extend it.
> Gunnar Schröder