|
|
-
Hive producing difference outputsChen Song 2012-11-15, 16:18
Hi Folks
We are getting inconsistent output when running some semantically same Hive queries. We are using *CDH3u3* with *Hive 0.7.1.* The query I am running performs a map-side join of two subqueries (s1 and s2)--s1 is a multi-table join and s2 is a union of a table against itself. If we add in mapside join to s1 (which is necessary for performance) the query produces different (and incorrect) results. Basically, s1 returns one row and s2 returns 2 rows. By joining them, it should give us two rows. See below for output and queries (I abridged the queries a bit for readability). NOTE: I searched online and found anther thread reporting a similar issue as what I have seen, http://mail-archives.apache.org/mod_mbox/hive-user/201207.mbox/%3CCADejTpz5rSf-Sxi9HNoCTRHVazOk=F+[EMAIL PROTECTED]%3E. Unfortunately, this thread never got answered. It looks like a bug in Hive. Greatly appreciate if anyone give thoughts on this issue. Q1 output (incorrect, missing one row): 2012-08-15 18:56:38 2258929231696355094 1 10 2258929231696355094 2012-08-1518:56:38 1315 0 US 728 90 0 97525 1 0 NULL 1 0 NULL Q2 output (correct): 2012-08-15 18:56:38 2258929231696355094 1 10 2258929231696355094 2012-08-1518:56:38 1315 0 US 728 90 0 97525 1 0 NULL 10 NULL 0 2012-08-15 18:56:38 2258929231696355094 1 10 2258929231696355094 2012-08-1518:56:38 1315 0 US 728 90 0 97525 1 0 NULL 10 NULL 1 Q1 and Q2 only differs in the way that s1 in Q1 has a mapside join. - Q1 (incorrect results, highlighted portion is the only difference from Q2): SELECT /*+MAPJOIN(table) */ * FROM ( SELECT /*+ MAPJOIN(tableA,tableB) */ id, ... FROM main_table ctg JOIN sample smp ON ctg.id = smp.id LEFT OUTER JOIN tableA ON smp.publisher_id = tableA.id LEFT OUTER JOIN tableB ON smp.inventory_source_id = tableB.id WHERE ctg.date_time >= '2012-08-15 00:00:00' and ctg.date_time <= '2012-08-15 23:59:59' and ctg.dd = '2012-08-15' and smp.date_time >= '2012-08-15 00:00:00' and smp.date_time <= '2012-08-15 23:59:59' and smp.dd = '2012-08-15' )sub JOIN ( select id, parent_category_id, cast(0 as tinyint) as is_parent from table_category where is_system = 1 union all select id, parent_category_id, cast(1 as tinyint) as is_parent from table_category where is_system = 1 )table ON table.id = sub.id - Q2 (correct results): SELECT /*+MAPJOIN(table) */ * FROM ( SELECT id, ... FROM main_table ctg JOIN sample smp ON ctg.id = smp.id LEFT OUTER JOIN tableA ON smp.publisher_id = tableA.id LEFT OUTER JOIN tableB ON smp.inventory_source_id = tableB.id WHERE ctg.date_time >= '2012-08-15 00:00:00' and ctg.date_time <= '2012-08-15 23:59:59' and ctg.dd = '2012-08-15' and smp.date_time >= '2012-08-15 00:00:00' and smp.date_time <= '2012-08-15 23:59:59' and smp.dd = '2012-08-15' )sub JOIN ( select id, parent_category_id, cast(0 as tinyint) as is_parent from table_category where is_system = 1 union all select id, parent_category_id, cast(1 as tinyint) as is_parent from table_category where is_system = 1 )table ON table.id = sub.id -- Chen Song |