Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - questions about statistics in 0.7


Copy link to this message
-
Re: questions about statistics in 0.7
Ning Zhang 2011-05-26, 22:21

On May 26, 2011, at 1:28 PM, Guy Bayes wrote:

Crap sorry hit send too early

questions
1: Job overhead of generating statistics on the fly with

set hive.stats.autogather=true;?

Overhead is minimum. The only accountable overhead is to insert a row into a RDBMS/HBase at the end of a task. At the end of the query, there will be an aggregation query on the RDBMS/HBase. Trunk (0.8-snapshot) has some more optimizations to further reduce the overhead. Note that every DBMS/HBase operations can be timed out. You can also config the timeout value as appropriate.

2: Is stat descriptions in describe table extended implemented? I've gathered stats on a table but do not see the expected entries (rowNum = , etc) in the describe statement?

It is working. If rowNum is not there, there must be some error occurred during stats publishing or aggregation, which is designed to be forgiving for any exceptions so that it won't affect the main query. You can take a look at the hive log at /tmp/<username>/hive.log or the task log to search for Stats warning messages.

3: How does hive actually use stats to influence query plans? Any documentation?

Currently no optimizations are done based on these stats, although that's one of our intentions.

we are on CDH3 GA by the way

thanks
Guy

On Thu, May 26, 2011 at 1:25 PM, Guy Bayes <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Hello all, I'm new to this list,

I was wondering if anyone could answer a couple questions about the implementation of statistics in 0.7?

I've reviewed
http://wiki.apache.org/hadoop/Hive/StatsDev

and have the following q