On May 26, 2011, at 1:28 PM, Guy Bayes wrote:
Crap sorry hit send too early
1: Job overhead of generating statistics on the fly with
Overhead is minimum. The only accountable overhead is to insert a row into a RDBMS/HBase at the end of a task. At the end of the query, there will be an aggregation query on the RDBMS/HBase. Trunk (0.8-snapshot) has some more optimizations to further reduce the overhead. Note that every DBMS/HBase operations can be timed out. You can also config the timeout value as appropriate.
2: Is stat descriptions in describe table extended implemented? I've gathered stats on a table but do not see the expected entries (rowNum = , etc) in the describe statement?
It is working. If rowNum is not there, there must be some error occurred during stats publishing or aggregation, which is designed to be forgiving for any exceptions so that it won't affect the main query. You can take a look at the hive log at /tmp/<username>/hive.log or the task log to search for Stats warning messages.
3: How does hive actually use stats to influence query plans? Any documentation?
Currently no optimizations are done based on these stats, although that's one of our intentions.
we are on CDH3 GA by the way
On Thu, May 26, 2011 at 1:25 PM, Guy Bayes <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Hello all, I'm new to this list,
I was wondering if anyone could answer a couple questions about the implementation of statistics in 0.7?
and have the following q