Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Hive Join with distinct rows


Copy link to this message
-
Re: Hive Join with distinct rows
Mark Grover 2012-11-09, 04:24
Hi Praveen,
Let's take an example:
(from
https://cwiki.apache.org/Hive/languagemanual-select.html#LanguageManualSelect-ALLandDISTINCTClauses
)

-- Print out contents of the table
hive> SELECT col1, col2 FROM t1;
1 3
1 3
1 4
2 5

-- Selects distinct col1, col2 tuple
hive> SELECT DISTINCT col1, col2 FROM t1;
1 3
1 4
2 5
Similar to the second query above, your Query 2 selects each of the
distinct values for <id, url, timestamp> tuple possibly giving you multiple
records for a given id on the left side of the join. Consequently you don't
get the result you expect.

Mark

On Thu, Nov 8, 2012 at 6:15 PM, Praveen Kumar K J V S <
[EMAIL PROTECTED]> wrote:

> Hi,
>
> I have 2 tables, T1 and T2 of structure columns= {ID, url, timestamp}
>
> In T1 ID's are repeatable. But T2 ID's are kind of primary key hence only
> unique values are present.
>
> I want to join both tables T1 & T2 such that select all the unique ID's in
> T1 which are not in  T2
>
> I have written 2 queries:
>
> Query1: select distinct T1.ID from T1 LEFT OUTER JOIN T2 on T1.ID=T2.IDAND
> T2.ID=NULL
>
> this gives me expected results, but with the below I am getting all the
> rows in T1
>
> Query2: select distinct(T1.ID), T1.url, T1.timestamp from T1 LEFT OUTER
> JOIN T2 on T1.ID=T2.ID AND T2.ID=NULL
>
> Can some one point me how to achieve: select all the unique ID's in T1
> which are not in T2
>
> Thanks,
> Praveen
>
>