Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Hive producing difference outputs


Copy link to this message
-
Hive producing difference outputs
Chen Song 2012-11-15, 16:18
Hi Folks

We are getting inconsistent output when running some semantically same Hive
queries. We are using *CDH3u3* with *Hive 0.7.1.*

The query I am running performs a map-side join of two subqueries (s1 and
s2)--s1 is a multi-table join and s2 is a union of a table against itself.
If we add in mapside join to s1 (which is necessary for performance) the
query produces different (and incorrect) results.

Basically, s1 returns one row and s2 returns 2 rows. By joining them, it
should give us two rows. See below for output and queries (I abridged the
queries a bit for readability).

NOTE: I searched online and found anther thread reporting a similar issue
as what I have seen,
http://mail-archives.apache.org/mod_mbox/hive-user/201207.mbox/%3CCADejTpz5rSf-Sxi9HNoCTRHVazOk=F+[EMAIL PROTECTED]%3E.
Unfortunately, this thread never got answered. It looks like a bug in Hive.
Greatly appreciate if anyone give thoughts on this issue.
Q1 output (incorrect, missing one row):
2012-08-15 18:56:38 2258929231696355094 1 10 2258929231696355094
2012-08-1518:56:38 1315 0 US 728 90 0 97525 1 0 NULL 1 0 NULL

Q2 output (correct):
2012-08-15 18:56:38 2258929231696355094 1 10 2258929231696355094
2012-08-1518:56:38 1315 0 US 728 90 0 97525 1 0 NULL 10 NULL 0
2012-08-15 18:56:38 2258929231696355094 1 10 2258929231696355094
2012-08-1518:56:38 1315 0 US 728 90 0 97525 1 0 NULL 10 NULL 1

Q1 and Q2 only differs in the way that s1 in Q1 has a mapside join.

   - Q1 (incorrect results, highlighted portion is the only difference from
   Q2):
SELECT /*+MAPJOIN(table) */ *
FROM
(
  SELECT /*+ MAPJOIN(tableA,tableB) */
   id,
   ...
  FROM main_table ctg
  JOIN sample smp
    ON ctg.id = smp.id
  LEFT OUTER JOIN tableA
    ON smp.publisher_id = tableA.id
  LEFT OUTER JOIN tableB
    ON smp.inventory_source_id = tableB.id
  WHERE ctg.date_time >= '2012-08-15 00:00:00' and
        ctg.date_time <= '2012-08-15 23:59:59' and
        ctg.dd = '2012-08-15' and
        smp.date_time >= '2012-08-15 00:00:00' and
        smp.date_time <= '2012-08-15 23:59:59' and
        smp.dd = '2012-08-15'
)sub
JOIN
(
  select
    id,
    parent_category_id,
    cast(0 as tinyint) as is_parent
  from table_category
  where is_system = 1
  union all
  select
    id,
    parent_category_id,
    cast(1 as tinyint) as is_parent
  from table_category
  where is_system = 1
)table
ON table.id = sub.id
   - Q2 (correct results):

SELECT /*+MAPJOIN(table) */ *
FROM
(
  SELECT
   id,
   ...
  FROM main_table ctg
  JOIN sample smp
    ON ctg.id = smp.id
  LEFT OUTER JOIN tableA
    ON smp.publisher_id = tableA.id
  LEFT OUTER JOIN tableB
    ON smp.inventory_source_id = tableB.id
  WHERE ctg.date_time >= '2012-08-15 00:00:00' and
        ctg.date_time <= '2012-08-15 23:59:59' and
        ctg.dd = '2012-08-15' and
        smp.date_time >= '2012-08-15 00:00:00' and
        smp.date_time <= '2012-08-15 23:59:59' and
        smp.dd = '2012-08-15'
)sub
JOIN
(
  select
    id,
    parent_category_id,
    cast(0 as tinyint) as is_parent
  from table_category
  where is_system = 1
  union all
  select
    id,
    parent_category_id,
    cast(1 as tinyint) as is_parent
  from table_category
  where is_system = 1
)table
ON table.id = sub.id

--
Chen Song