Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Multi-group-by with transform leads to incorrect optimization


Copy link to this message
-
Multi-group-by with transform leads to incorrect optimization
Hello,

I've encountered an issue with hive's predicate push down optimization when
multi-group-by is used together with transform. Here is a simple testcase
to illustrate my point:

CREATE TABLE IF NOT EXISTS my_table (
        id INT,
        property1 INT,
        property2 INT,
        count INT
);

EXPLAIN
FROM (
    SELECT TRANSFORM(
        id,
        property1,
        property2,
        count
    ) USING 'cat' AS (
        id INT,
        property1 INT,
        count INT,
        property2 INT
    )
    FROM my_table
) t
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/test1'
SELECT id, property1, SUM(count)
GROUP BY id, property1
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/test2'
SELECT id, property2, SUM(count)
WHERE property1 != 0
GROUP BY id, property2;

When hive.optimize.ppd = true, hive moves the where clause from second
select all the way down into the transform operator, which is obviously
wrong, because it affects the first select as well. With
hive.optimize.ppd=false everything works as expected. Without the
transform, it works correctly as well.

I see this problem with Hive version 0.10.0 (cdh4.4.0). With Hive 0.7.1 the
same query behaves correctly, regardless of hive.optimize.ppd settings. So
it seems as a bug introduced with some ppd improvements in 0.8 or later.
Can anyone confirm if this is still broken in newest versions? If it
doesn't work with 0.12, I'll file a new issue in JIRA.

Best regards,
Jan Dolinar

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB