
Why do I get statistics diff in EXPLAIN for Parquet?
Looking at the failed Jenkins runs for HIVE5998, I see there are diffs in the statistics in the EXPLAIN:
Running: diff a /root/hive/itests/qtest/../../itests/qtest/target/qfileresults/clientpositive/vectorized_parquet.q.out /root/hive/itests/qtest/../../ql/src/test/results/clientpositive/vectorized_parquet.q.out
72c72
< Statistics: Num rows: 12288 Data size: 73728 Basic stats: COMPLETE Column stats: NONE
75c75
< Statistics: Num rows: 6144 Data size: 36864 Basic stats: COMPLETE Column stats: NONE
79c79
< Statistics: Num rows: 6144 Data size: 36864 Basic stats: COMPLETE Column stats: NONE
82c82
< Statistics: Num rows: 10 Data size: 60 Basic stats: COMPLETE Column stats: NONE
What would cause such statistics diffs? The Parquet file is created as:
create table if not exists alltypes_parquet (
cint int,
ctinyint tinyint,
csmallint smallint,
cfloat float,
cdouble double,
cstring1 string) stored as parquet;
insert overwrite table alltypes_parquet
select cint,
ctinyint,
csmallint,
cfloat,
cdouble,
cstring1
from alltypesorc;
Note that there are no diffs in the actual query results.
Thanks,
~Remus

RE: Why do I get statistics diff in EXPLAIN for Parquet?
OK, so I get the similar diffs with ORC, so is not Parquet.
The expected .out files are created running mvn test on Windows, so the issue is Windows specific not Parquet specific. I'll investigate...
From: Remus Rusanu [mailto:[EMAIL PROTECTED]]
Sent: Monday, February 17, 2014 3:59 PM
To: [EMAIL PROTECTED]
Cc: Brock Noland
Subject: Why do I get statistics diff in EXPLAIN for Parquet?
Looking at the failed Jenkins runs for HIVE5998, I see there are diffs in the statistics in the EXPLAIN:
Running: diff a /root/hive/itests/qtest/../../itests/qtest/target/qfileresults/clientpositive/vectorized_parquet.q.out /root/hive/itests/qtest/../../ql/src/test/results/clientpositive/vectorized_parquet.q.out
72c72
< Statistics: Num rows: 12288 Data size: 73728 Basic stats: COMPLETE Column stats: NONE
75c75
< Statistics: Num rows: 6144 Data size: 36864 Basic stats: COMPLETE Column stats: NONE
79c79
< Statistics: Num rows: 6144 Data size: 36864 Basic stats: COMPLETE Column stats: NONE
82c82
< Statistics: Num rows: 10 Data size: 60 Basic stats: COMPLETE Column stats: NONE
What would cause such statistics diffs? The Parquet file is created as:
create table if not exists alltypes_parquet (
cint int,
ctinyint tinyint,
csmallint smallint,
cfloat float,
cdouble double,
cstring1 string) stored as parquet;
insert overwrite table alltypes_parquet
select cint,
ctinyint,
csmallint,
cfloat,
cdouble,
cstring1
from alltypesorc;
Note that there are no diffs in the actual query results.
Thanks,
~Remus