Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - Hive producing difference outputs


+
Chen Song 2012-11-15, 16:18
+
Mark Grover 2012-12-20, 10:03
Copy link to this message
-
Re: Hive producing difference outputs
Chen Song 2012-12-19, 16:55
resent as no one has replied yet :)

On Thu, Nov 15, 2012 at 11:18 AM, Chen Song <[EMAIL PROTECTED]> wrote:

> Hi Folks
>
> We are getting inconsistent output when running some semantically same
> Hive queries. We are using *CDH3u3* with *Hive 0.7.1.*
>
> The query I am running performs a map-side join of two subqueries (s1 and
> s2)--s1 is a multi-table join and s2 is a union of a table against itself.
> If we add in mapside join to s1 (which is necessary for performance) the
> query produces different (and incorrect) results.
>
> Basically, s1 returns one row and s2 returns 2 rows. By joining them, it
> should give us two rows. See below for output and queries (I abridged the
> queries a bit for readability).
>
> NOTE: I searched online and found anther thread reporting a similar issue
> as what I have seen,
> http://mail-archives.apache.org/mod_mbox/hive-user/201207.mbox/%3CCADejTpz5rSf-Sxi9HNoCTRHVazOk=F+[EMAIL PROTECTED]%3E.
> Unfortunately, this thread never got answered. It looks like a bug in Hive.
> Greatly appreciate if anyone give thoughts on this issue.
>
>
> Q1 output (incorrect, missing one row):
> 2012-08-15 18:56:38 2258929231696355094 1 10 2258929231696355094
> 2012-08-15 18:56:38 1315 0 US 728 90 0 97525 1 0 NULL 1 0 NULL
>
> Q2 output (correct):
> 2012-08-15 18:56:38 2258929231696355094 1 10 2258929231696355094
> 2012-08-15 18:56:38 1315 0 US 728 90 0 97525 1 0 NULL 10 NULL 0
> 2012-08-15 18:56:38 2258929231696355094 1 10 2258929231696355094
> 2012-08-15 18:56:38 1315 0 US 728 90 0 97525 1 0 NULL 10 NULL 1
>
> Q1 and Q2 only differs in the way that s1 in Q1 has a mapside join.
>
>    - Q1 (incorrect results, highlighted portion is the only difference
>    from Q2):
>
>
>  SELECT /*+MAPJOIN(table) */ *
> FROM
> (
>   SELECT /*+ MAPJOIN(tableA,tableB) */
>    id,
>    ...
>   FROM main_table ctg
>   JOIN sample smp
>     ON ctg.id = smp.id
>   LEFT OUTER JOIN tableA
>     ON smp.publisher_id = tableA.id
>   LEFT OUTER JOIN tableB
>     ON smp.inventory_source_id = tableB.id
>   WHERE ctg.date_time >= '2012-08-15 00:00:00' and
>         ctg.date_time <= '2012-08-15 23:59:59' and
>         ctg.dd = '2012-08-15' and
>         smp.date_time >= '2012-08-15 00:00:00' and
>         smp.date_time <= '2012-08-15 23:59:59' and
>         smp.dd = '2012-08-15'
> )sub
> JOIN
>  (
>   select
>     id,
>     parent_category_id,
>     cast(0 as tinyint) as is_parent
>   from table_category
>   where is_system = 1
>   union all
>   select
>     id,
>     parent_category_id,
>     cast(1 as tinyint) as is_parent
>   from table_category
>   where is_system = 1
> )table
>  ON table.id = sub.id
>
>
>    - Q2 (correct results):
>
>  SELECT /*+MAPJOIN(table) */ *
> FROM
> (
>   SELECT
>    id,
>    ...
>   FROM main_table ctg
>   JOIN sample smp
>     ON ctg.id = smp.id
>   LEFT OUTER JOIN tableA
>     ON smp.publisher_id = tableA.id
>   LEFT OUTER JOIN tableB
>     ON smp.inventory_source_id = tableB.id
>   WHERE ctg.date_time >= '2012-08-15 00:00:00' and
>         ctg.date_time <= '2012-08-15 23:59:59' and
>         ctg.dd = '2012-08-15' and
>         smp.date_time >= '2012-08-15 00:00:00' and
>         smp.date_time <= '2012-08-15 23:59:59' and
>         smp.dd = '2012-08-15'
> )sub
> JOIN
>  (
>   select
>     id,
>     parent_category_id,
>     cast(0 as tinyint) as is_parent
>   from table_category
>   where is_system = 1
>   union all
>   select
>     id,
>     parent_category_id,
>     cast(1 as tinyint) as is_parent
>   from table_category
>   where is_system = 1
> )table
>  ON table.id = sub.id
>
> --
> Chen Song
>
>
>
--
Chen Song