|
|
-
Hive Join with distinct rows
Praveen Kumar K J V S 2012-11-09, 02:15
Hi,
I have 2 tables, T1 and T2 of structure columns= {ID, url, timestamp}
In T1 ID's are repeatable. But T2 ID's are kind of primary key hence only unique values are present.
I want to join both tables T1 & T2 such that select all the unique ID's in T1 which are not in T2
I have written 2 queries:
Query1: select distinct T1.ID from T1 LEFT OUTER JOIN T2 on T1.ID=T2.ID AND T2.ID=NULL
this gives me expected results, but with the below I am getting all the rows in T1
Query2: select distinct(T1.ID), T1.url, T1.timestamp from T1 LEFT OUTER JOIN T2 on T1.ID=T2.ID AND T2.ID=NULL
Can some one point me how to achieve: select all the unique ID's in T1 which are not in T2
Thanks, Praveen
+
Praveen Kumar K J V S 2012-11-09, 02:15
-
Re: Hive Join with distinct rows
Mark Grover 2012-11-09, 04:24
Hi Praveen, Let's take an example: (from https://cwiki.apache.org/Hive/languagemanual-select.html#LanguageManualSelect-ALLandDISTINCTClauses) -- Print out contents of the table hive> SELECT col1, col2 FROM t1; 1 3 1 3 1 4 2 5 -- Selects distinct col1, col2 tuple hive> SELECT DISTINCT col1, col2 FROM t1; 1 3 1 4 2 5 Similar to the second query above, your Query 2 selects each of the distinct values for <id, url, timestamp> tuple possibly giving you multiple records for a given id on the left side of the join. Consequently you don't get the result you expect. Mark On Thu, Nov 8, 2012 at 6:15 PM, Praveen Kumar K J V S < [EMAIL PROTECTED]> wrote: > Hi, > > I have 2 tables, T1 and T2 of structure columns= {ID, url, timestamp} > > In T1 ID's are repeatable. But T2 ID's are kind of primary key hence only > unique values are present. > > I want to join both tables T1 & T2 such that select all the unique ID's in > T1 which are not in T2 > > I have written 2 queries: > > Query1: select distinct T1.ID from T1 LEFT OUTER JOIN T2 on T1.ID=T2.IDAND > T2.ID=NULL > > this gives me expected results, but with the below I am getting all the > rows in T1 > > Query2: select distinct(T1.ID), T1.url, T1.timestamp from T1 LEFT OUTER > JOIN T2 on T1.ID=T2.ID AND T2.ID=NULL > > Can some one point me how to achieve: select all the unique ID's in T1 > which are not in T2 > > Thanks, > Praveen > >
+
Mark Grover 2012-11-09, 04:24
-
Re: Hive Join with distinct rows
Praveen Kumar K J V S 2012-11-09, 16:00
Thanks Mark, I do understand that how Hive works with Distinct keyword. What I was looking for is a solution for my requirement in Hive, I am not an expert in SQL, hence looking for suggestions On Fri, Nov 9, 2012 at 9:54 AM, Mark Grover <[EMAIL PROTECTED]>wrote: > Hi Praveen, > Let's take an example: > (from > https://cwiki.apache.org/Hive/languagemanual-select.html#LanguageManualSelect-ALLandDISTINCTClauses> ) > > -- Print out contents of the table > hive> SELECT col1, col2 FROM t1; > 1 3 > 1 3 > 1 4 > 2 5 > > -- Selects distinct col1, col2 tuple > hive> SELECT DISTINCT col1, col2 FROM t1; > 1 3 > 1 4 > 2 5 > > > Similar to the second query above, your Query 2 selects each of the > distinct values for <id, url, timestamp> tuple possibly giving you multiple > records for a given id on the left side of the join. Consequently you don't > get the result you expect. > > Mark > > On Thu, Nov 8, 2012 at 6:15 PM, Praveen Kumar K J V S < > [EMAIL PROTECTED]> wrote: > >> Hi, >> >> I have 2 tables, T1 and T2 of structure columns= {ID, url, timestamp} >> >> In T1 ID's are repeatable. But T2 ID's are kind of primary key hence only >> unique values are present. >> >> I want to join both tables T1 & T2 such that select all the unique ID's >> in T1 which are not in T2 >> >> I have written 2 queries: >> >> Query1: select distinct T1.ID from T1 LEFT OUTER JOIN T2 on T1.ID=T2.IDAND >> T2.ID=NULL >> >> this gives me expected results, but with the below I am getting all the >> rows in T1 >> >> Query2: select distinct(T1.ID), T1.url, T1.timestamp from T1 LEFT OUTER >> JOIN T2 on T1.ID=T2.ID AND T2.ID=NULL >> >> Can some one point me how to achieve: select all the unique ID's in T1 >> which are not in T2 >> >> Thanks, >> Praveen >> >> >
+
Praveen Kumar K J V S 2012-11-09, 16:00
-
Re: Hive Join with distinct rows
Bejoy KS 2012-11-09, 16:26
Hi Praveen Have you tried applying DISTINCT without the brackets around T1.ID select distinct T1.ID, T1.url, T1.timestamp from T1 LEFT OUTER JOIN T2 on T1.IDT2.ID AND T2.ID=NULL Regards Bejoy KS Sent from handheld, please excuse typos. -----Original Message----- From: Praveen Kumar K J V S <[EMAIL PROTECTED]> Date: Fri, 9 Nov 2012 21:30:28 To: <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: Re: Hive Join with distinct rows Thanks Mark, I do understand that how Hive works with Distinct keyword. What I was looking for is a solution for my requirement in Hive, I am not an expert in SQL, hence looking for suggestions On Fri, Nov 9, 2012 at 9:54 AM, Mark Grover <[EMAIL PROTECTED]>wrote: > Hi Praveen, > Let's take an example: > (from > https://cwiki.apache.org/Hive/languagemanual-select.html#LanguageManualSelect-ALLandDISTINCTClauses> ) > > -- Print out contents of the table > hive> SELECT col1, col2 FROM t1; > 1 3 > 1 3 > 1 4 > 2 5 > > -- Selects distinct col1, col2 tuple > hive> SELECT DISTINCT col1, col2 FROM t1; > 1 3 > 1 4 > 2 5 > > > Similar to the second query above, your Query 2 selects each of the > distinct values for <id, url, timestamp> tuple possibly giving you multiple > records for a given id on the left side of the join. Consequently you don't > get the result you expect. > > Mark > > On Thu, Nov 8, 2012 at 6:15 PM, Praveen Kumar K J V S < > [EMAIL PROTECTED]> wrote: > >> Hi, >> >> I have 2 tables, T1 and T2 of structure columns= {ID, url, timestamp} >> >> In T1 ID's are repeatable. But T2 ID's are kind of primary key hence only >> unique values are present. >> >> I want to join both tables T1 & T2 such that select all the unique ID's >> in T1 which are not in T2 >> >> I have written 2 queries: >> >> Query1: select distinct T1.ID from T1 LEFT OUTER JOIN T2 on T1.ID=T2.IDAND >> T2.ID=NULL >> >> this gives me expected results, but with the below I am getting all the >> rows in T1 >> >> Query2: select distinct(T1.ID), T1.url, T1.timestamp from T1 LEFT OUTER >> JOIN T2 on T1.ID=T2.ID AND T2.ID=NULL >> >> Can some one point me how to achieve: select all the unique ID's in T1 >> which are not in T2 >> >> Thanks, >> Praveen >> >> >
+
Bejoy KS 2012-11-09, 16:26
-
Re: Hive Join with distinct rows
Mark Grover 2012-11-09, 16:34
I see. I re-read your first email and you would like to query "select all the unique ID's in T1 which are not in T2" Query 1 seems to be doing just fine so I would say that's the way to go. I personally use "IS" operator when comparing something with NULLs instead of "=". There are some optimizations you can read about like semi joins that might come in handy for this query or queries in the future. https://cwiki.apache.org/Hive/languagemanual-joins.htmlMark On Fri, Nov 9, 2012 at 8:00 AM, Praveen Kumar K J V S < [EMAIL PROTECTED]> wrote: > Thanks Mark, I do understand that how Hive works with Distinct keyword. > > What I was looking for is a solution for my requirement in Hive, I am not > an expert in SQL, hence looking for suggestions > > > On Fri, Nov 9, 2012 at 9:54 AM, Mark Grover <[EMAIL PROTECTED]>wrote: > >> Hi Praveen, >> Let's take an example: >> (from >> https://cwiki.apache.org/Hive/languagemanual-select.html#LanguageManualSelect-ALLandDISTINCTClauses>> ) >> >> -- Print out contents of the table >> hive> SELECT col1, col2 FROM t1; >> 1 3 >> 1 3 >> 1 4 >> 2 5 >> >> -- Selects distinct col1, col2 tuple >> hive> SELECT DISTINCT col1, col2 FROM t1; >> 1 3 >> 1 4 >> 2 5 >> >> >> Similar to the second query above, your Query 2 selects each of the >> distinct values for <id, url, timestamp> tuple possibly giving you multiple >> records for a given id on the left side of the join. Consequently you don't >> get the result you expect. >> >> Mark >> >> On Thu, Nov 8, 2012 at 6:15 PM, Praveen Kumar K J V S < >> [EMAIL PROTECTED]> wrote: >> >>> Hi, >>> >>> I have 2 tables, T1 and T2 of structure columns= {ID, url, timestamp} >>> >>> In T1 ID's are repeatable. But T2 ID's are kind of primary key hence >>> only unique values are present. >>> >>> I want to join both tables T1 & T2 such that select all the unique ID's >>> in T1 which are not in T2 >>> >>> I have written 2 queries: >>> >>> Query1: select distinct T1.ID from T1 LEFT OUTER JOIN T2 on T1.ID=T2.IDAND >>> T2.ID=NULL >>> >>> this gives me expected results, but with the below I am getting all the >>> rows in T1 >>> >>> Query2: select distinct(T1.ID), T1.url, T1.timestamp from T1 LEFT OUTER >>> JOIN T2 on T1.ID=T2.ID AND T2.ID=NULL >>> >>> Can some one point me how to achieve: select all the unique ID's in T1 >>> which are not in T2 >>> >>> Thanks, >>> Praveen >>> >>> >> >
+
Mark Grover 2012-11-09, 16:34
-
Re: Hive Join with distinct rows
Praveen Kumar K J V S 2012-11-09, 17:03
Thank you very much Mark Yes query1 is doing just fine, but using query1 I will not be able to get the data in other columns in table T1 On Fri, Nov 9, 2012 at 10:04 PM, Mark Grover <[EMAIL PROTECTED]>wrote: > I see. I re-read your first email and you would like to query "select all > the unique ID's in T1 which are not in T2" > > Query 1 seems to be doing just fine so I would say that's the way to go. I > personally use "IS" operator when comparing something with NULLs instead of > "=". > > There are some optimizations you can read about like semi joins that might > come in handy for this query or queries in the future. > > https://cwiki.apache.org/Hive/languagemanual-joins.html> > Mark > > > > On Fri, Nov 9, 2012 at 8:00 AM, Praveen Kumar K J V S < > [EMAIL PROTECTED]> wrote: > >> Thanks Mark, I do understand that how Hive works with Distinct keyword. >> >> What I was looking for is a solution for my requirement in Hive, I am not >> an expert in SQL, hence looking for suggestions >> >> >> On Fri, Nov 9, 2012 at 9:54 AM, Mark Grover <[EMAIL PROTECTED]>wrote: >> >>> Hi Praveen, >>> Let's take an example: >>> (from >>> https://cwiki.apache.org/Hive/languagemanual-select.html#LanguageManualSelect-ALLandDISTINCTClauses>>> ) >>> >>> -- Print out contents of the table >>> hive> SELECT col1, col2 FROM t1; >>> 1 3 >>> 1 3 >>> 1 4 >>> 2 5 >>> >>> -- Selects distinct col1, col2 tuple >>> hive> SELECT DISTINCT col1, col2 FROM t1; >>> 1 3 >>> 1 4 >>> 2 5 >>> >>> >>> Similar to the second query above, your Query 2 selects each of the >>> distinct values for <id, url, timestamp> tuple possibly giving you multiple >>> records for a given id on the left side of the join. Consequently you don't >>> get the result you expect. >>> >>> Mark >>> >>> On Thu, Nov 8, 2012 at 6:15 PM, Praveen Kumar K J V S < >>> [EMAIL PROTECTED]> wrote: >>> >>>> Hi, >>>> >>>> I have 2 tables, T1 and T2 of structure columns= {ID, url, timestamp} >>>> >>>> In T1 ID's are repeatable. But T2 ID's are kind of primary key hence >>>> only unique values are present. >>>> >>>> I want to join both tables T1 & T2 such that select all the unique ID's >>>> in T1 which are not in T2 >>>> >>>> I have written 2 queries: >>>> >>>> Query1: select distinct T1.ID from T1 LEFT OUTER JOIN T2 on T1.ID=T2.IDAND >>>> T2.ID=NULL >>>> >>>> this gives me expected results, but with the below I am getting all the >>>> rows in T1 >>>> >>>> Query2: select distinct(T1.ID), T1.url, T1.timestamp from T1 LEFT >>>> OUTER JOIN T2 on T1.ID=T2.ID AND T2.ID=NULL >>>> >>>> Can some one point me how to achieve: select all the unique ID's in T1 >>>> which are not in T2 >>>> >>>> Thanks, >>>> Praveen >>>> >>>> >>> >> >
+
Praveen Kumar K J V S 2012-11-09, 17:03
-
Re: Hive Join with distinct rows
Praveen Kumar K J V S 2012-11-09, 17:08
But I think Hive should support distinct on single column along with fetching corresponding data from other columns mentioned in the query. Something like "Select distinct(col1), col2, col3 from TB1" For example hive> SELECT col1, col2 FROM t1; 1 3 1 3 1 4 2 5 -- Selects distinct col1, col2 tuple hive> SELECT DISTINCT(col1), col2 FROM t1; 1 (3 or 4) 2 5 Does anyone think on the same line and suggest Hive supporting this feature. Thanks, Praveen On Fri, Nov 9, 2012 at 10:33 PM, Praveen Kumar K J V S < [EMAIL PROTECTED]> wrote: > Thank you very much Mark > > Yes query1 is doing just fine, but using query1 I will not be able to get > the data in other columns in table T1 > > > > On Fri, Nov 9, 2012 at 10:04 PM, Mark Grover <[EMAIL PROTECTED]>wrote: > >> I see. I re-read your first email and you would like to query "select all >> the unique ID's in T1 which are not in T2" >> >> Query 1 seems to be doing just fine so I would say that's the way to go. >> I personally use "IS" operator when comparing something with NULLs instead >> of "=". >> >> There are some optimizations you can read about like semi joins that >> might come in handy for this query or queries in the future. >> >> https://cwiki.apache.org/Hive/languagemanual-joins.html>> >> Mark >> >> >> >> On Fri, Nov 9, 2012 at 8:00 AM, Praveen Kumar K J V S < >> [EMAIL PROTECTED]> wrote: >> >>> Thanks Mark, I do understand that how Hive works with Distinct keyword. >>> >>> What I was looking for is a solution for my requirement in Hive, I am >>> not an expert in SQL, hence looking for suggestions >>> >>> >>> On Fri, Nov 9, 2012 at 9:54 AM, Mark Grover <[EMAIL PROTECTED] >>> > wrote: >>> >>>> Hi Praveen, >>>> Let's take an example: >>>> (from >>>> https://cwiki.apache.org/Hive/languagemanual-select.html#LanguageManualSelect-ALLandDISTINCTClauses>>>> ) >>>> >>>> -- Print out contents of the table >>>> hive> SELECT col1, col2 FROM t1; >>>> 1 3 >>>> 1 3 >>>> 1 4 >>>> 2 5 >>>> >>>> -- Selects distinct col1, col2 tuple >>>> hive> SELECT DISTINCT col1, col2 FROM t1; >>>> 1 3 >>>> 1 4 >>>> 2 5 >>>> >>>> >>>> Similar to the second query above, your Query 2 selects each of the >>>> distinct values for <id, url, timestamp> tuple possibly giving you multiple >>>> records for a given id on the left side of the join. Consequently you don't >>>> get the result you expect. >>>> >>>> Mark >>>> >>>> On Thu, Nov 8, 2012 at 6:15 PM, Praveen Kumar K J V S < >>>> [EMAIL PROTECTED]> wrote: >>>> >>>>> Hi, >>>>> >>>>> I have 2 tables, T1 and T2 of structure columns= {ID, url, timestamp} >>>>> >>>>> In T1 ID's are repeatable. But T2 ID's are kind of primary key hence >>>>> only unique values are present. >>>>> >>>>> I want to join both tables T1 & T2 such that select all the unique >>>>> ID's in T1 which are not in T2 >>>>> >>>>> I have written 2 queries: >>>>> >>>>> Query1: select distinct T1.ID from T1 LEFT OUTER JOIN T2 on T1.ID>>>>> T2.ID AND T2.ID=NULL >>>>> >>>>> this gives me expected results, but with the below I am getting all >>>>> the rows in T1 >>>>> >>>>> Query2: select distinct(T1.ID), T1.url, T1.timestamp from T1 LEFT >>>>> OUTER JOIN T2 on T1.ID=T2.ID AND T2.ID=NULL >>>>> >>>>> Can some one point me how to achieve: select all the unique ID's in T1 >>>>> which are not in T2 >>>>> >>>>> Thanks, >>>>> Praveen >>>>> >>>>> >>>> >>> >> >
+
Praveen Kumar K J V S 2012-11-09, 17:08
|
|