Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Hive Join with distinct rows


Copy link to this message
-
Re: Hive Join with distinct rows
Bejoy KS 2012-11-09, 16:26
Hi Praveen

Have you tried applying DISTINCT without the brackets around T1.ID

select distinct
T1.ID, T1.url, T1.timestamp from T1 LEFT OUTER JOIN T2 on
T1.IDT2.ID AND
T2.ID=NULL
Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Praveen Kumar K J V S <[EMAIL PROTECTED]>
Date: Fri, 9 Nov 2012 21:30:28
To: <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Subject: Re: Hive Join with distinct rows

Thanks Mark, I do understand that how Hive works with Distinct keyword.

What I was looking for is a solution for my requirement in Hive, I am not
an expert in SQL, hence looking for suggestions
On Fri, Nov 9, 2012 at 9:54 AM, Mark Grover <[EMAIL PROTECTED]>wrote:

> Hi Praveen,
> Let's take an example:
> (from
> https://cwiki.apache.org/Hive/languagemanual-select.html#LanguageManualSelect-ALLandDISTINCTClauses
> )
>
> -- Print out contents of the table
> hive> SELECT col1, col2 FROM t1;
> 1 3
> 1 3
> 1 4
> 2 5
>
> -- Selects distinct col1, col2 tuple
> hive> SELECT DISTINCT col1, col2 FROM t1;
> 1 3
> 1 4
> 2 5
>
>
> Similar to the second query above, your Query 2 selects each of the
> distinct values for <id, url, timestamp> tuple possibly giving you multiple
> records for a given id on the left side of the join. Consequently you don't
> get the result you expect.
>
> Mark
>
> On Thu, Nov 8, 2012 at 6:15 PM, Praveen Kumar K J V S <
> [EMAIL PROTECTED]> wrote:
>
>> Hi,
>>
>> I have 2 tables, T1 and T2 of structure columns= {ID, url, timestamp}
>>
>> In T1 ID's are repeatable. But T2 ID's are kind of primary key hence only
>> unique values are present.
>>
>> I want to join both tables T1 & T2 such that select all the unique ID's
>> in T1 which are not in  T2
>>
>> I have written 2 queries:
>>
>> Query1: select distinct T1.ID from T1 LEFT OUTER JOIN T2 on T1.ID=T2.IDAND
>> T2.ID=NULL
>>
>> this gives me expected results, but with the below I am getting all the
>> rows in T1
>>
>> Query2: select distinct(T1.ID), T1.url, T1.timestamp from T1 LEFT OUTER
>> JOIN T2 on T1.ID=T2.ID AND T2.ID=NULL
>>
>> Can some one point me how to achieve: select all the unique ID's in T1
>> which are not in T2
>>
>> Thanks,
>> Praveen
>>
>>
>