


HIVE left semi join multiple tables with columns from multiple table
Hi,
I am struggling with a problem described below. Any help how to resolve this problem is highly appreciated.
I have got few tables the structure is over simplified for the sake of describing the nature of the problem that I am facing.
*Table 1* * *
ID
Value
1
V11
2
V12
3
V13
*Table 2* * *
ID
Value
1
V21
2
V22
3
V23
*Table 3* * *
ID_1
ID_2
Other_Column1
1
3
C1
2
1
C2
3
2
C3
I want to formulate a table which would look like
ID_1
ID_1_Value
ID_2
ID_2_value
Other_Column1
1
V11
3
V23
C1
2
V12
1
V21
C2
3
V13
2
V22
C3
I am facing problem with this. I tried to LEFT SEMI JOIN in Hive.
I tried to do it in two steps (For 3 tables) First step I wanted to do a LEFT SEMI JOIN with TABLE 1 and TABLE 3. But the problem is that in LEFT SEMI JOIN you can not have columns from the right table in the select clause. So after join my new table simply looks like TABLE 3.
Can you help me how I can achieve this is HIVE.
Thanks and regards, Souvik.

Re: HIVE left semi join multiple tables with columns from multiple table
Souvik, Let me begin by saying that simplifying the problem goes a long way in helping us answer your question. You did it really nicely here, so thank you for doing that.
Why don't you use INNER JOIN instead of LEFT SEMI JOIN? You can express the same query as INNER JOIN with no restrictions on what can be included in the SELECT clause. LEFT SEMI JOIN just implements an "exists" like query efficiently. If you want columns from the right table in your select list, just do the regular (aka inefficient way) inner join.
Of course, you can optimize inner joins as map joins, sorted merge join or sorted merge bucketed joins depending on your use case.
Mark
On Thu, Dec 27, 2012 at 3:16 PM, Souvik Banerjee <[EMAIL PROTECTED]>wrote:
> Hi, > > I am struggling with a problem described below. > Any help how to resolve this problem is highly appreciated. > > I have got few tables the structure is over simplified for the sake of > describing the nature of the problem that I am facing. > > *Table 1* > * * > > > > ID > > Value > > 1 > > V11 > > 2 > > V12 > > 3 > > V13 > > > > *Table 2* > * * > > > > ID > > Value > > 1 > > V21 > > 2 > > V22 > > 3 > > V23 > > > > *Table 3* > * * > > > > ID_1 > > ID_2 > > Other_Column1 > > 1 > > 3 > > C1 > > 2 > > 1 > > C2 > > 3 > > 2 > > C3 > > > > I want to formulate a table which would look like > > ID_1 > > ID_1_Value > > ID_2 > > ID_2_value > > Other_Column1 > > 1 > > V11 > > 3 > > V23 > > C1 > > 2 > > V12 > > 1 > > V21 > > C2 > > 3 > > V13 > > 2 > > V22 > > C3 > > > > I am facing problem with this. > I tried to LEFT SEMI JOIN in Hive. > > I tried to do it in two steps (For 3 tables) > First step I wanted to do a LEFT SEMI JOIN with TABLE 1 and TABLE 3. > But the problem is that in LEFT SEMI JOIN you can not have columns from > the right table in the select clause. So after join my new table simply > looks like TABLE 3. > > Can you help me how I can achieve this is HIVE. > > Thanks and regards, > Souvik. >

Re: HIVE left semi join multiple tables with columns from multiple table
Thanks a lot Mark for your attention. But I think I cannot go for INNER join, the reason behind the fact being that I want all rows of Table3 irrespective of there is any row corresponding to that ID in Table 1 or Table 2. Probably I would have taken care of that while providing the example. The more refined example would be like below.
Look forward for your help.
Thanks and regards, Souvik.
P.S. Dropped one table from the earlier example and Now we have table 1 and table 2, we are looking for table 4. (Hope so it's more simple and it's exactly we need)
*Table 1*
ID
Value
1
V11
2
V12
3
V13
*Table 2*
ID_1
ID_2
Other_Column1
1
3
C1
2
1
C2
3
2
C3
4
9
C4
12
3
C5
1
8
C6
I want to formulate a table which would look like (*Table 4*)
ID_1
ID_1_Value
ID_2
ID_2_value
Other_Column1
1
V11
3
V13
C1
2
V12
1
V11
C2
3
V13
2
V12
C3
4
NULL / Empty
9
NULL / Empty
C4
12
NULL / Empty
3
V13
C5
1
V11
8
NULL / Empty
C6 On Thu, Dec 27, 2012 at 9:06 PM, Mark Grover <[EMAIL PROTECTED]>wrote:
> Souvik, > Let me begin by saying that simplifying the problem goes a long way in > helping us answer your question. You did it really nicely here, so thank > you for doing that. > > Why don't you use INNER JOIN instead of LEFT SEMI JOIN? You can express > the same query as INNER JOIN with no restrictions on what can be included > in the SELECT clause. LEFT SEMI JOIN just implements an "exists" like query > efficiently. If you want columns from the right table in your select list, > just do the regular (aka inefficient way) inner join. > > Of course, you can optimize inner joins as map joins, sorted merge join or > sorted merge bucketed joins depending on your use case. > > Mark > > On Thu, Dec 27, 2012 at 3:16 PM, Souvik Banerjee <[EMAIL PROTECTED] > > wrote: > >> Hi, >> >> I am struggling with a problem described below. >> Any help how to resolve this problem is highly appreciated. >> >> I have got few tables the structure is over simplified for the sake of >> describing the nature of the problem that I am facing. >> >> *Table 1* >> * * >> >> >> >> ID >> >> Value >> >> 1 >> >> V11 >> >> 2 >> >> V12 >> >> 3 >> >> V13 >> >> >> >> *Table 2* >> * * >> >> >> >> ID >> >> Value >> >> 1 >> >> V21 >> >> 2 >> >> V22 >> >> 3 >> >> V23 >> >> >> >> *Table 3* >> * * >> >> >> >> ID_1 >> >> ID_2 >> >> Other_Column1 >> >> 1 >> >> 3 >> >> C1 >> >> 2 >> >> 1 >> >> C2 >> >> 3 >> >> 2 >> >> C3 >> >> >> >> I want to formulate a table which would look like >> >> ID_1 >> >> ID_1_Value >> >> ID_2 >> >> ID_2_value >> >> Other_Column1 >> >> 1 >> >> V11 >> >> 3 >> >> V23 >> >> C1 >> >> 2 >> >> V12 >> >> 1 >> >> V21 >> >> C2 >> >> 3 >> >> V13 >> >> 2 >> >> V22 >> >> C3 >> >> >> >> I am facing problem with this. >> I tried to LEFT SEMI JOIN in Hive. >> >> I tried to do it in two steps (For 3 tables) >> First step I wanted to do a LEFT SEMI JOIN with TABLE 1 and TABLE 3. >> But the problem is that in LEFT SEMI JOIN you can not have columns from >> the right table in the select clause. So after join my new table simply >> looks like TABLE 3. >> >> Can you help me how I can achieve this is HIVE. >> >> Thanks and regards, >> Souvik. >> > >

Re: HIVE left semi join multiple tables with columns from multiple table
Souvik, In your new example, you need a RIGHT OUTER JOIN between table1 and table2 (order matters  table1 on left, table2 on right) on the ID1 column. Something like this (untested by me): SELECT table1.*, table2.* FROM table1 RIGHT OUTER JOIN table2 ON (table1.id=table2.id_1);
Mark
On Thu, Dec 27, 2012 at 9:26 PM, Souvik Banerjee <[EMAIL PROTECTED]>wrote:
> Thanks a lot Mark for your attention. > But I think I cannot go for INNER join, the reason behind the fact being > that I want all rows of Table3 irrespective of there is any row > corresponding to that ID in Table 1 or Table 2. > Probably I would have taken care of that while providing the example. > The more refined example would be like below. > > Look forward for your help. > > Thanks and regards, > Souvik. > > P.S. Dropped one table from the earlier example and Now we have table 1 > and table 2, we are looking for table 4. (Hope so it's more simple and it's > exactly we need) > > > > *Table 1* > > ID > > Value > > 1 > > V11 > > 2 > > V12 > > 3 > > V13 > > > > *Table 2* > > ID_1 > > ID_2 > > Other_Column1 > > 1 > > 3 > > C1 > > 2 > > 1 > > C2 > > 3 > > 2 > > C3 > > 4 > > 9 > > C4 > > 12 > > 3 > > C5 > > 1 > > 8 > > C6 > > > > I want to formulate a table which would look like (*Table 4*) > > ID_1 > > ID_1_Value > > ID_2 > > ID_2_value > > Other_Column1 > > 1 > > V11 > > 3 > > V13 > > C1 > > 2 > > V12 > > 1 > > V11 > > C2 > > 3 > > V13 > > 2 > > V12 > > C3 > > 4 > > NULL / Empty > > 9 > > NULL / Empty > > C4 > > 12 > > NULL / Empty > > 3 > > V13 > > C5 > > 1 > > V11 > > 8 > > NULL / Empty > > C6 > > > On Thu, Dec 27, 2012 at 9:06 PM, Mark Grover <[EMAIL PROTECTED]>wrote: > >> Souvik, >> Let me begin by saying that simplifying the problem goes a long way in >> helping us answer your question. You did it really nicely here, so thank >> you for doing that. >> >> Why don't you use INNER JOIN instead of LEFT SEMI JOIN? You can express >> the same query as INNER JOIN with no restrictions on what can be included >> in the SELECT clause. LEFT SEMI JOIN just implements an "exists" like query >> efficiently. If you want columns from the right table in your select list, >> just do the regular (aka inefficient way) inner join. >> >> Of course, you can optimize inner joins as map joins, sorted merge join >> or sorted merge bucketed joins depending on your use case. >> >> Mark >> >> On Thu, Dec 27, 2012 at 3:16 PM, Souvik Banerjee < >> [EMAIL PROTECTED]> wrote: >> >>> Hi, >>> >>> I am struggling with a problem described below. >>> Any help how to resolve this problem is highly appreciated. >>> >>> I have got few tables the structure is over simplified for the sake of >>> describing the nature of the problem that I am facing. >>> >>> *Table 1* >>> * * >>> >>> >>> >>> ID >>> >>> Value >>> >>> 1 >>> >>> V11 >>> >>> 2 >>> >>> V12 >>> >>> 3 >>> >>> V13 >>> >>> >>> >>> *Table 2* >>> * * >>> >>> >>> >>> ID >>> >>> Value >>> >>> 1 >>> >>> V21 >>> >>> 2 >>> >>> V22 >>> >>> 3 >>> >>> V23 >>> >>> >>> >>> *Table 3* >>> * * >>> >>> >>> >>> ID_1 >>> >>> ID_2 >>> >>> Other_Column1 >>> >>> 1 >>> >>> 3 >>> >>> C1 >>> >>> 2 >>> >>> 1 >>> >>> C2 >>> >>> 3 >>> >>> 2 >>> >>> C3 >>> >>> >>> >>> I want to formulate a table which would look like >>> >>> ID_1 >>> >>> ID_1_Value >>> >>> ID_2 >>> >>> ID_2_value >>> >>> Other_Column1 >>> >>> 1 >>> >>> V11 >>> >>> 3 >>> >>> V23 >>> >>> C1 >>> >>> 2 >>> >>> V12 >>> >>> 1 >>> >>> V21 >>> >>> C2 >>> >>> 3 >>> >>> V13 >>> >>> 2 >>> >>> V22 >>> >>> C3 >>> >>> >>> >>> I am facing problem with this. >>> I tried to LEFT SEMI JOIN in Hive. >>> >>> I tried to do it in two steps (For 3 tables) >>> First step I wanted to do a LEFT SEMI JOIN with TABLE 1 and TABLE 3. >>> But the problem is that in LEFT SEMI JOIN you can not have columns from >>> the right table in the select clause. So after join my new table simply >>> looks like TABLE 3. >>> >>> Can you help me how I can achieve this is HIVE. >>> >>> Thanks and regards, >>> Souvik.

Re: HIVE left semi join multiple tables with columns from multiple table
Thanks Mark for writing back to me. I did the same thing and still not getting the desired result. I was looking for the cause. I managed to get the desired result. Sharing my thought process and how I did it. Let me know your thoughts on this, that would be really nice to have. Also please share any better idea to achieve this result and feel free to point out if you think that my explanation is wrong. I think you noticed the fact that I want the value column for both the ID_1 and ID_2 and there values in these columns are not UNIQUE. It happened that I used to get more rows returned than the original no of rows in the second table due to the fact that there are duplicates in ID_1 and ID_2 column. So first of all to get the Value column for both ID_1 and ID_2 I have to join twice and that too I can not have join on condition which have duplicates. So I created a new table out of table 2 with addition of a column having unique value for each row using HIVE UDF. Then I created 2 views each for ID_1 and ID_2 which contains the description by joining on the condition a.ID_1 = b.ID. Now I have two views and then I joined these two views on the condition that view1.uniqueid = view2.uniqueid. That gives me desired output.
Looking forward to have your views.
Thanks and regards, Souvik.
On Fri, Dec 28, 2012 at 7:40 AM, Mark Grover <[EMAIL PROTECTED]>wrote:
> Souvik, > In your new example, you need a RIGHT OUTER JOIN between table1 and table2 > (order matters  table1 on left, table2 on right) on the ID1 column. > Something like this (untested by me): > SELECT > table1.*, > table2.* > FROM > table1 > RIGHT OUTER JOIN table2 > ON (table1.id=table2.id_1); > > Mark > > > On Thu, Dec 27, 2012 at 9:26 PM, Souvik Banerjee <[EMAIL PROTECTED] > > wrote: > >> Thanks a lot Mark for your attention. >> But I think I cannot go for INNER join, the reason behind the fact being >> that I want all rows of Table3 irrespective of there is any row >> corresponding to that ID in Table 1 or Table 2. >> Probably I would have taken care of that while providing the example. >> The more refined example would be like below. >> >> Look forward for your help. >> >> Thanks and regards, >> Souvik. >> >> P.S. Dropped one table from the earlier example and Now we have table 1 >> and table 2, we are looking for table 4. (Hope so it's more simple and it's >> exactly we need) >> >> >> >> *Table 1* >> >> ID >> >> Value >> >> 1 >> >> V11 >> >> 2 >> >> V12 >> >> 3 >> >> V13 >> >> >> >> *Table 2* >> >> ID_1 >> >> ID_2 >> >> Other_Column1 >> >> 1 >> >> 3 >> >> C1 >> >> 2 >> >> 1 >> >> C2 >> >> 3 >> >> 2 >> >> C3 >> >> 4 >> >> 9 >> >> C4 >> >> 12 >> >> 3 >> >> C5 >> >> 1 >> >> 8 >> >> C6 >> >> >> >> I want to formulate a table which would look like (*Table 4*) >> >> ID_1 >> >> ID_1_Value >> >> ID_2 >> >> ID_2_value >> >> Other_Column1 >> >> 1 >> >> V11 >> >> 3 >> >> V13 >> >> C1 >> >> 2 >> >> V12 >> >> 1 >> >> V11 >> >> C2 >> >> 3 >> >> V13 >> >> 2 >> >> V12 >> >> C3 >> >> 4 >> >> NULL / Empty >> >> 9 >> >> NULL / Empty >> >> C4 >> >> 12 >> >> NULL / Empty >> >> 3 >> >> V13 >> >> C5 >> >> 1 >> >> V11 >> >> 8 >> >> NULL / Empty >> >> C6 >> >> >> On Thu, Dec 27, 2012 at 9:06 PM, Mark Grover <[EMAIL PROTECTED] >> > wrote: >> >>> Souvik, >>> Let me begin by saying that simplifying the problem goes a long way in >>> helping us answer your question. You did it really nicely here, so thank >>> you for doing that. >>> >>> Why don't you use INNER JOIN instead of LEFT SEMI JOIN? You can express >>> the same query as INNER JOIN with no restrictions on what can be included >>> in the SELECT clause. LEFT SEMI JOIN just implements an "exists" like query >>> efficiently. If you want columns from the right table in your select list, >>> just do the regular (aka inefficient way) inner join. >>> >>> Of course, you can optimize inner joins as map joins, sorted merge join >>> or sorted merge bucketed joins depending on your use case. >>> >>> Mark >>>

Re: HIVE left semi join multiple tables with columns from multiple table
Souvik, I am not sure I understand, you may want to play around with distinct keyword if you want to join on unique values. In any case, it looks like you've got it figured out already, so it's all good:)
Happy trails! Mark
On Sat, Dec 29, 2012 at 2:19 PM, Souvik Banerjee <[EMAIL PROTECTED]>wrote:
> Thanks Mark for writing back to me. > I did the same thing and still not getting the desired result. > I was looking for the cause. I managed to get the desired result. Sharing > my thought process and how I did it. Let me know your thoughts on this, > that would be really nice to have. Also please share any better idea to > achieve this result and feel free to point out if you think that my > explanation is wrong. > I think you noticed the fact that I want the value column for both the > ID_1 and ID_2 and there values in these columns are not UNIQUE. > It happened that I used to get more rows returned than the original no of > rows in the second table due to the fact that there are duplicates in ID_1 > and ID_2 column. > So first of all to get the Value column for both ID_1 and ID_2 I have to > join twice and that too I can not have join on condition which have > duplicates. > So I created a new table out of table 2 with addition of a column having > unique value for each row using HIVE UDF. > Then I created 2 views each for ID_1 and ID_2 which contains the > description by joining on the condition a.ID_1 = b.ID. > Now I have two views and then I joined these two views on the condition > that view1.uniqueid = view2.uniqueid. > That gives me desired output. > > Looking forward to have your views. > > Thanks and regards, > Souvik. > > > On Fri, Dec 28, 2012 at 7:40 AM, Mark Grover <[EMAIL PROTECTED]>wrote: > >> Souvik, >> In your new example, you need a RIGHT OUTER JOIN between table1 and >> table2 (order matters  table1 on left, table2 on right) on the ID1 column. >> Something like this (untested by me): >> SELECT >> table1.*, >> table2.* >> FROM >> table1 >> RIGHT OUTER JOIN table2 >> ON (table1.id=table2.id_1); >> >> Mark >> >> >> On Thu, Dec 27, 2012 at 9:26 PM, Souvik Banerjee < >> [EMAIL PROTECTED]> wrote: >> >>> Thanks a lot Mark for your attention. >>> But I think I cannot go for INNER join, the reason behind the fact being >>> that I want all rows of Table3 irrespective of there is any row >>> corresponding to that ID in Table 1 or Table 2. >>> Probably I would have taken care of that while providing the example. >>> The more refined example would be like below. >>> >>> Look forward for your help. >>> >>> Thanks and regards, >>> Souvik. >>> >>> P.S. Dropped one table from the earlier example and Now we have table 1 >>> and table 2, we are looking for table 4. (Hope so it's more simple and it's >>> exactly we need) >>> >>> >>> >>> *Table 1* >>> >>> ID >>> >>> Value >>> >>> 1 >>> >>> V11 >>> >>> 2 >>> >>> V12 >>> >>> 3 >>> >>> V13 >>> >>> >>> >>> *Table 2* >>> >>> ID_1 >>> >>> ID_2 >>> >>> Other_Column1 >>> >>> 1 >>> >>> 3 >>> >>> C1 >>> >>> 2 >>> >>> 1 >>> >>> C2 >>> >>> 3 >>> >>> 2 >>> >>> C3 >>> >>> 4 >>> >>> 9 >>> >>> C4 >>> >>> 12 >>> >>> 3 >>> >>> C5 >>> >>> 1 >>> >>> 8 >>> >>> C6 >>> >>> >>> >>> I want to formulate a table which would look like (*Table 4*) >>> >>> ID_1 >>> >>> ID_1_Value >>> >>> ID_2 >>> >>> ID_2_value >>> >>> Other_Column1 >>> >>> 1 >>> >>> V11 >>> >>> 3 >>> >>> V13 >>> >>> C1 >>> >>> 2 >>> >>> V12 >>> >>> 1 >>> >>> V11 >>> >>> C2 >>> >>> 3 >>> >>> V13 >>> >>> 2 >>> >>> V12 >>> >>> C3 >>> >>> 4 >>> >>> NULL / Empty >>> >>> 9 >>> >>> NULL / Empty >>> >>> C4 >>> >>> 12 >>> >>> NULL / Empty >>> >>> 3 >>> >>> V13 >>> >>> C5 >>> >>> 1 >>> >>> V11 >>> >>> 8 >>> >>> NULL / Empty >>> >>> C6 >>> >>> >>> On Thu, Dec 27, 2012 at 9:06 PM, Mark Grover < >>> [EMAIL PROTECTED]> wrote: >>> >>>> Souvik, >>>> Let me begin by saying that simplifying the problem goes a long way in >>>> helping us answer your question. You did it really nicely here, so thank >>>> you for doing that.

