Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # dev - Why do I get statistics diff in EXPLAIN for Parquet?


Copy link to this message
-
RE: Why do I get statistics diff in EXPLAIN for Parquet?
Remus Rusanu 2014-02-17, 15:07
OK, so I get the similar diffs  with ORC, so is not Parquet.
The expected .out files are created running mvn test on Windows, so the issue is Windows specific not Parquet specific. I'll investigate...

From: Remus Rusanu [mailto:[EMAIL PROTECTED]]
Sent: Monday, February 17, 2014 3:59 PM
To: [EMAIL PROTECTED]
Cc: Brock Noland
Subject: Why do I get statistics diff in EXPLAIN for Parquet?

Looking at the failed Jenkins runs for HIVE-5998, I see there are diffs in the statistics in the EXPLAIN:

Running: diff -a /root/hive/itests/qtest/../../itests/qtest/target/qfile-results/clientpositive/vectorized_parquet.q.out /root/hive/itests/qtest/../../ql/src/test/results/clientpositive/vectorized_parquet.q.out
72c72
<             Statistics: Num rows: 12288 Data size: 73728 Basic stats: COMPLETE Column stats: NONE
75c75
<               Statistics: Num rows: 6144 Data size: 36864 Basic stats: COMPLETE Column stats: NONE
79c79
<                 Statistics: Num rows: 6144 Data size: 36864 Basic stats: COMPLETE Column stats: NONE
82c82
<                   Statistics: Num rows: 10 Data size: 60 Basic stats: COMPLETE Column stats: NONE

What would cause such statistics diffs? The Parquet file is created as:

create table if not exists alltypes_parquet (
  cint int,
  ctinyint tinyint,
  csmallint smallint,
  cfloat float,
  cdouble double,
  cstring1 string) stored as parquet;

insert overwrite table alltypes_parquet
  select cint,
    ctinyint,
    csmallint,
    cfloat,
    cdouble,
    cstring1
  from alltypesorc;

Note that there are no diffs in the actual query results.

Thanks,
~Remus