Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Hive Join with distinct rows


Copy link to this message
-
Re: Hive Join with distinct rows
Hi Praveen

Have you tried applying DISTINCT without the brackets around T1.ID

select distinct
T1.ID, T1.url, T1.timestamp from T1 LEFT OUTER JOIN T2 on
T1.IDT2.ID AND
T2.ID=NULL
Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Praveen Kumar K J V S <[EMAIL PROTECTED]>
Date: Fri, 9 Nov 2012 21:30:28
To: <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Subject: Re: Hive Join with distinct rows

Thanks Mark, I do understand that how Hive works with Distinct keyword.

What I was looking for is a solution for my requirement in Hive, I am not
an expert in SQL, hence looking for suggestions
On Fri, Nov 9, 2012 at 9:54 AM, Mark Grover <[EMAIL PROTECTED]>wrote:

> Hi Praveen,
> Let's take an example:
> (from
> https://cwiki.apache.org/Hive/languagemanual-select.html#LanguageManualSelect-ALLandDISTINCTClauses
> )
>
> -- Print out contents of the table
> hive> SELECT col1, col2 FROM t1;
> 1 3
> 1 3
> 1 4
> 2 5
>
> -- Selects distinct col1, col2 tuple
> hive> SELECT DISTINCT col1, col2 FROM t1;
> 1 3
> 1 4
> 2 5
>
>
> Similar to the second query above, your Query 2 selects each of the
> distinct values for <id, url, timestamp> tuple possibly giving you multiple
> records for a given id on the left side of the join. Consequently you don't
> get the result you expect.
>
> Mark
>
> On Thu, Nov 8, 2012 at 6:15 PM, Praveen Kumar K J V S <
> [EMAIL PROTECTED]> wrote:
>
>> Hi,
>>
>> I have 2 tables, T1 and T2 of structure columns= {ID, url, timestamp}
>>
>> In T1 ID's are repeatable. But T2 ID's are kind of primary key hence only
>> unique values are present.
>>
>> I want to join both tables T1 & T2 such that select all the unique ID's
>> in T1 which are not in  T2
>>
>> I have written 2 queries:
>>
>> Query1: select distinct T1.ID from T1 LEFT OUTER JOIN T2 on T1.ID=T2.IDAND
>> T2.ID=NULL
>>
>> this gives me expected results, but with the below I am getting all the
>> rows in T1
>>
>> Query2: select distinct(T1.ID), T1.url, T1.timestamp from T1 LEFT OUTER
>> JOIN T2 on T1.ID=T2.ID AND T2.ID=NULL
>>
>> Can some one point me how to achieve: select all the unique ID's in T1
>> which are not in T2
>>
>> Thanks,
>> Praveen
>>
>>
>

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB