Both Parquet and ORC both support predicate pushdown. Was looking at
whether we can make use of the existing PartitionFilterOptimizer and report
whether columns supported for predicate pushdown can be reported as
partition columns. Dmitriy was talking about the PartitionFilterOptimizer
pushing down the filter conditions to the LoadFunc but not removing them
from the actual filter condition. But even the new FilterExtractor (and old
PColFilterExtractor) that Aniket wrote removes the filter condition pushed
down. And in a way it makes sense for HCat when you filter lot of
partitions, you don't want each record also again filtered for the
partition condition wasting CPU. But in case of columnar file formats, the
predicates pushed down is only for selection/skipping of row groups/stripes
and not answering actual queries. So we need a new optimizer for pushing
down predicates to file formats which does not remove the filter condition
and a new Load interface.
There are no jiras filed for this yet. Will file one soon. Has anyone
already given thought to this and have any API design in mind? We are
planning to work on this and the main focus is on ORCFile, but want to
ensure that we address all cases of Parquet as well. Julien/Aniket could
you help with any questions on the Parquet front?
ORCFile pushes down filter predicates using indexes/column sorting,
dictionary sorting or bloom filters according tohttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
don't think it can push down filters for complex data structures like list
or maps. Daniel, can you confirm?
Can you tell how predicate pushdown works with Parquet. Does it support
map columns? I could not find much documentation on it.