Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Predicate pushdown in columnar file formats


Copy link to this message
-
Predicate pushdown in columnar file formats
Hi,
   Both Parquet and ORC both support predicate pushdown. Was looking at
whether we can make use of the existing PartitionFilterOptimizer and report
whether columns supported for predicate pushdown can be reported as
partition columns. Dmitriy was talking about the  PartitionFilterOptimizer
pushing down the filter conditions to the LoadFunc but not removing them
from the actual filter condition. But even the new FilterExtractor (and old
PColFilterExtractor) that Aniket wrote removes the filter condition pushed
down. And in a way it makes sense for HCat when you filter lot of
partitions, you don't want each record also again filtered for the
partition condition wasting CPU. But in case of columnar file formats, the
predicates pushed down is only for selection/skipping of row groups/stripes
and not answering actual queries. So we need a new optimizer for pushing
down predicates to file formats which does not remove the filter condition
and a new Load interface.

 There are no jiras filed for this yet. Will file one soon. Has anyone
already given thought to this and have any API design in mind? We are
planning to work on this and the main focus is on ORCFile, but want to
ensure that we address all cases of Parquet as well. Julien/Aniket could
you help with any questions on the Parquet front?

ORCFile pushes down filter predicates using indexes/column sorting,
dictionary sorting or bloom filters according to
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC. I
don't think it can push down filters for complex data structures like list
or maps. Daniel, can you confirm?

Julien,
   Can you tell how predicate pushdown works with Parquet. Does it support
map columns? I could not find much documentation on it.

Regards,
Rohini

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB