Hello,

Today, on a table for which we have created statistics (through the REFRESH
TABLE METADATA <path to table> command), Drill validates the timestamp of
every files or directory involved in the scan.

If the timestamps of the files are greater than the one of the metadata
file, then a re-regeneration of the meta-data file is triggered.
In the case the timestamp of the metadata file is the greatest, then the
planning continues without regenerating the metadata.

When the number of files to be queried increases, this operation can take a
significant amount of time.
We have seen cases where this validation step alone is taking 3 to 5
seconds (just checking the timestamps), meaning the planning time was
taking way more time than the querying time.
And this can be problematic in some usecases where the response time is
favored compared to the `accuracy` of the data.

What would you think about adding an option to the metadata generation, so
that the metadata is trusted for a configurable time period
Example : REFRESH TABLE METADATA <path to table> WITH TTL='15m'
The exact syntax, of course, needs to be thought through.

This TTL would be stored in the metadata file, and used to determine if a
refresh is needed at each query. And this would significantly decrease the
planning time when the number of files represented in the metadata file is
important.

Of course, this means that there could be cases where the metadata would be
wrong, so cases like the one below would need to be solved (since they may
happen much more frequently):
https://issues.apache.org/jira/browse/DRILL-6194
But my feeling is that since we already do have a kind of race condition
between the view of the file system at the planning time, and the state
that will be found during the execution, we could gracefully accept that
some files may have disappeared between the planning and the execution.

In the case the TTL would need to be changed, or be removed completely,
this could be done by re-issuing a REFRESH TABLE METADATA, either with a
new TTL, or without TTL at all.

What do you think?

Regards, Joel
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB