Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # dev - Re: Review Request: HIVE-1362: Support for column statistics in Hive


+
Shreepadma Venugopalan 2012-10-30, 01:24
Copy link to this message
-
Re: Review Request: HIVE-1362: Support for column statistics in Hive
Shreepadma Venugopalan 2012-10-30, 18:39

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6878/
-----------------------------------------------------------

(Updated Oct. 30, 2012, 6:39 p.m.)
Review request for hive and Carl Steinbach.
Changes
-------

Fixes the lint problems from the previous revision.
Description
-------

This patch implements version 1 of the column statistics project in Hive. It adds support for computing and persisting statistical summary of column values in Hive Tables and Partitions. In order to support column statistics in Hive, this patch does the following,

* Adds a new compute stats UDAF to compute scalar statistics for all primitive Hive data types. In version 1 of the project, we support the following scalar statistics on primitive types - estimate of number of distinct values, number of null values, number of trues/falses for boolean typed columsn, max and avg length for string and binary typed columns, max and min value for long and double typed columns. Note that version 1 of the column stats project includes support for column statistics both at the table and partition level.

* Adds Metastore schema tables to persist the newly added statistics both at table and partition level.
* Adds Metastore Thrift API to persist, retrieve and delete column statistics at both table and partition level.
Please refer to the following wiki link for the details of the schema and the Thrift API changes - https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive

* Extends the analyze table compute statistics statement to trigger statistics computation and persistence for one or more columns. Please note that statistics for multiple columns is computed through a single scan of the table data. Please refer to the following wiki link for the syntax changes - https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive

One thing missing from the patch at this point is the metastore upgrade scrips for MySQL/Derby/Postgres/Oracle. I'm waiting for the review to finalize the metastore schema changes before I go ahead and add the upgrade scripts.

In a follow on patch, as part of version 2 of the column statistics project, we will add support for computing, persisting and retrieving histograms on long and double typed column values.

Generated Thrift files have been removed for viewing pleasure. JIRA page has the patch with the generated Thrift files.
This addresses bug HIVE-1362.
    https://issues.apache.org/jira/browse/HIVE-1362
Diffs (updated)
-----

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 211f474
  conf/hive-default.xml.template 93a86ec
  data/files/UserVisits.dat PRE-CREATION
  data/files/binary.txt PRE-CREATION
  data/files/bool.txt PRE-CREATION
  data/files/double.txt PRE-CREATION
  data/files/employee.dat PRE-CREATION
  data/files/employee2.dat PRE-CREATION
  data/files/int.txt PRE-CREATION
  metastore/if/hive_metastore.thrift d4fad72
  metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 915a5cf
  metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java 17b986c
  metastore/src/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java 3883b5b
  metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java a49aecd
  metastore/src/java/org/apache/hadoop/hive/metastore/RawStore.java bf5ae3a
  metastore/src/java/org/apache/hadoop/hive/metastore/Warehouse.java 77d1caa
  metastore/src/model/org/apache/hadoop/hive/metastore/model/MPartitionColumnStatistics.java PRE-CREATION
  metastore/src/model/org/apache/hadoop/hive/metastore/model/MTableColumnStatistics.java PRE-CREATION
  metastore/src/model/package.jdo 38ce6d5
  metastore/src/test/org/apache/hadoop/hive/metastore/DummyRawStoreForJdoConnection.java 528a100
  metastore/src/test/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java 925938d
  ql/build.xml 80b7f79
  ql/if/queryplan.thrift 05fbf58
  ql/ivy.xml 2c4410a
  ql/src/java/org/apache/hadoop/hive/ql/exec/ColumnStatsTask.java PRE-CREATION
  ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 425900d
  ql/src/java/org/apache/hadoop/hive/ql/exec/Task.java 4446952
  ql/src/java/org/apache/hadoop/hive/ql/exec/TaskFactory.java 79b87f1
  ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java de9fc04
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java 0b55ac4
  ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java c9e356a
  ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 65f748c
  ql/src/java/org/apache/hadoop/hive/ql/parse/QB.java a0ccbe6
  ql/src/java/org/apache/hadoop/hive/ql/parse/QBParseInfo.java 1c48815
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 349ab29
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzerFactory.java e77d59e
  ql/src/java/org/apache/hadoop/hive/ql/parse/StatsSemanticAnalyzer.java PRE-CREATION
  ql/src/java/org/apache/hadoop/hive/ql/plan/ColumnStatsDesc.java PRE-CREATION
  ql/src/java/org/apache/hadoop/hive/ql/plan/ColumnStatsWork.java PRE-CREATION
  ql/src/java/org/apache/hadoop/hive/ql/plan/HiveOperation.java 11db6b7
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/DoubleNumDistinctValueEstimator.java PRE-CREATION
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFComputeStats.java PRE-CREATION
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/LongNumDistinctValueEstimator.java PRE-CREATION
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumDistinctValueEstimator.java PRE-CREATION
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/StringNumDistinctValueEstimator.java PRE-CREATION
  ql/src/test/queries/clientnegative/columnstats_partlvl.q PRE-CREATION
  ql/src/test/queries/clientpositive/columnstats_partlvl.q PRE-CREATION
  ql/src/test/queries/clientpositive/columnstats_tbllvl.q PRE-CREATION
  ql/src/test/queries/