Hi everyone,

Currently I am working on the implementation of the Parquet page index for
Impala.
(design doc is here if you are interested:
https://docs.google.com/document/d/1D-el8njq_I-JKd3NDcW1mRXID_n0dBDKIkjWxwULVus/edit?usp=sharing
)

During our discussions it came up that DataPageHeaderV2 states that page
boundaries are also record boundaries:

https://github.com/apache/parquet-format/blob/54e6133e887a6ea90501ddd72fff5312b7038a7c/src/main/thrift/parquet.thrift#L532

DataPageHeader(V1) doesn't have this statement, which means that in theory
it allows records to span through multiple pages. Is it really the case, or
is it something that is missing from the specification?

I ask this because filtering pages based on the page index is much more
simple if page boundaries are record boundaries as well.

Thanks,
    Zoltan
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB