I've encountered an issue with hive's predicate push down optimization when
multi-group-by is used together with transform. Here is a simple testcase
to illustrate my point:
CREATE TABLE IF NOT EXISTS my_table (
) USING 'cat' AS (
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/test1'
SELECT id, property1, SUM(count)
GROUP BY id, property1
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/test2'
SELECT id, property2, SUM(count)
WHERE property1 != 0
GROUP BY id, property2;
When hive.optimize.ppd = true, hive moves the where clause from second
select all the way down into the transform operator, which is obviously
wrong, because it affects the first select as well. With
hive.optimize.ppd=false everything works as expected. Without the
transform, it works correctly as well.
I see this problem with Hive version 0.10.0 (cdh4.4.0). With Hive 0.7.1 the
same query behaves correctly, regardless of hive.optimize.ppd settings. So
it seems as a bug introduced with some ppd improvements in 0.8 or later.
Can anyone confirm if this is still broken in newest versions? If it
doesn't work with 0.12, I'll file a new issue in JIRA.