|
|
mahsa mofidpoor 2012-03-13, 16:17
Hello,
Is there a reason behind not implementing non-equality joins in Hive? In other words, is there any usage for theta-join, if implemented?
Thank you in advance for your response, Mahsa
+
mahsa mofidpoor 2012-03-13, 16:17
Keith Wiley 2012-03-13, 16:49
https://cwiki.apache.org/Hive/languagemanual-joins.html"Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job." I admit, that isn't a very detailed answer, but it gives some indication of the reason for the discrepancy between Hive and other databases. Hive fundamentally operates on Hadoop, namely on MapReduce (we all know this, I'm just reiterating the train of thought). The problem is that certain algorithms are exceedingly difficult to wedge into the MapReduce framework. That is as detailed as my personal insight can get. I've done a lot of MapReduce programming in Hadoop but I'm not a database expert and I don't really understand the steps involved in various kinds of table-joins, so I don't understand the particular ways in which certain database operations do or do not fit into MapReduce...but presumably nonequality joins (whatever those are :-D ) are particularly difficult to MapReduceify. Cheers! On Mar 13, 2012, at 09:17 , mahsa mofidpoor wrote: > Hello, > > Is there a reason behind not implementing non-equality joins in Hive? In other words, is there any usage for theta-join, if implemented? > > Thank you in advance for your response, > Mahsa ________________________________________________________________________________ Keith Wiley [EMAIL PROTECTED] keithwiley.com music.keithwiley.com "It's a fine line between meticulous and obsessive-compulsive and a slippery rope between obsessive-compulsive and debilitatingly slow." -- Keith Wiley ________________________________________________________________________________
+
Keith Wiley 2012-03-13, 16:49
mahsa mofidpoor 2012-03-13, 17:02
Hi Keith, Do you know exactly how an algorithm should be in order to fit in the MapReduce framework? Could you refer me to some references? Thanks and Regards, Mahsa On Tue, Mar 13, 2012 at 12:49 PM, Keith Wiley <[EMAIL PROTECTED]> wrote: > https://cwiki.apache.org/Hive/languagemanual-joins.html> > "Hive does not support join conditions that are not equality conditions as > it is very difficult to express such conditions as a map/reduce job." > > I admit, that isn't a very detailed answer, but it gives some indication > of the reason for the discrepancy between Hive and other databases. Hive > fundamentally operates on Hadoop, namely on MapReduce (we all know this, > I'm just reiterating the train of thought). The problem is that certain > algorithms are exceedingly difficult to wedge into the MapReduce framework. > > That is as detailed as my personal insight can get. I've done a lot of > MapReduce programming in Hadoop but I'm not a database expert and I don't > really understand the steps involved in various kinds of table-joins, so I > don't understand the particular ways in which certain database operations > do or do not fit into MapReduce...but presumably nonequality joins > (whatever those are :-D ) are particularly difficult to MapReduceify. > > Cheers! > > On Mar 13, 2012, at 09:17 , mahsa mofidpoor wrote: > > > Hello, > > > > Is there a reason behind not implementing non-equality joins in Hive? In > other words, is there any usage for theta-join, if implemented? > > > > Thank you in advance for your response, > > Mahsa > > > > ________________________________________________________________________________ > Keith Wiley [EMAIL PROTECTED] keithwiley.com > music.keithwiley.com > > "It's a fine line between meticulous and obsessive-compulsive and a > slippery > rope between obsessive-compulsive and debilitatingly slow." > -- Keith Wiley > > ________________________________________________________________________________ > >
+
mahsa mofidpoor 2012-03-13, 17:02
Tucker, Matt 2012-03-13, 17:13
For theta joins, you'll have to convert the query to an equi-join, and then filter for non-equality in the WHERE clause. Depending upon the size of each table, you might consider looking at map-side joins, which will allow for doing non-equality filters during a join before it's passed to the reducers. Matt Tucker From: mahsa mofidpoor [mailto:[EMAIL PROTECTED]] Sent: Tuesday, March 13, 2012 1:02 PM To: [EMAIL PROTECTED] Subject: Re: non-equality joins Hi Keith, Do you know exactly how an algorithm should be in order to fit in the MapReduce framework? Could you refer me to some references? Thanks and Regards, Mahsa On Tue, Mar 13, 2012 at 12:49 PM, Keith Wiley <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: https://cwiki.apache.org/Hive/languagemanual-joins.html"Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job." I admit, that isn't a very detailed answer, but it gives some indication of the reason for the discrepancy between Hive and other databases. Hive fundamentally operates on Hadoop, namely on MapReduce (we all know this, I'm just reiterating the train of thought). The problem is that certain algorithms are exceedingly difficult to wedge into the MapReduce framework. That is as detailed as my personal insight can get. I've done a lot of MapReduce programming in Hadoop but I'm not a database expert and I don't really understand the steps involved in various kinds of table-joins, so I don't understand the particular ways in which certain database operations do or do not fit into MapReduce...but presumably nonequality joins (whatever those are :-D ) are particularly difficult to MapReduceify. Cheers! On Mar 13, 2012, at 09:17 , mahsa mofidpoor wrote: > Hello, > > Is there a reason behind not implementing non-equality joins in Hive? In other words, is there any usage for theta-join, if implemented? > > Thank you in advance for your response, > Mahsa ________________________________________________________________________________ Keith Wiley [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> keithwiley.com< http://keithwiley.com> music.keithwiley.com< http://music.keithwiley.com>"It's a fine line between meticulous and obsessive-compulsive and a slippery rope between obsessive-compulsive and debilitatingly slow." -- Keith Wiley ________________________________________________________________________________
+
Tucker, Matt 2012-03-13, 17:13
Keith Wiley 2012-03-13, 17:28
Sounds like Matt possesses the proper combination of expertise in both databases and MapReduce to assist you. I'm bowing out as I honestly don't know advanced database concepts at all. In addition, hive offers hive-specific tools like Matt suggested (map-side joins) to help out, which I'm too new too to speculate on. I'm just starting hive this week as a matter of fact.
The short answer on MapReduce algorithms is that the individual computational units can't communicate with each other (each mapper or each map() in fact cannot communicate with the others, likewise for reducers). That's one of the major distinctions between MapReduce and more general parallel processing frameworks like MPI. This is the wrong mailing list to go much deeper than that however.
Thanks Matt.
Best of luck Mahsa.
On Mar 13, 2012, at 10:13 , Tucker, Matt wrote:
> For theta joins, you’ll have to convert the query to an equi-join, and then filter for non-equality in the WHERE clause. Depending upon the size of each table, you might consider looking at map-side joins, which will allow for doing non-equality filters during a join before it’s passed to the reducers. > > Matt Tucker > > From: mahsa mofidpoor [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, March 13, 2012 1:02 PM > To: [EMAIL PROTECTED] > Subject: Re: non-equality joins > > > Hi Keith, > > Do you know exactly how an algorithm should be in order to fit in the MapReduce framework? Could you refer me to some references? > > Thanks and Regards, > Mahsa >
________________________________________________________________________________ Keith Wiley [EMAIL PROTECTED] keithwiley.com music.keithwiley.com
"Luminous beings are we, not this crude matter." -- Yoda ________________________________________________________________________________
+
Keith Wiley 2012-03-13, 17:28
Alan Gates 2012-03-17, 00:41
There are algorithms for doing general theta-joins in parallel. Search Google on "theta joins parallel database" and you will find some interesting references. I am not aware of any tools that implement these yet. You can also do it via a cross join followed by a filter, but again you need special algorithms to do a cross in MapReduce, which Hive doesn't implement yet. See http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html (search for the section on Cross) for a discussion of how to do cross in MapReduce. Alan. On Mar 13, 2012, at 10:13 AM, Tucker, Matt wrote: > For theta joins, you’ll have to convert the query to an equi-join, and then filter for non-equality in the WHERE clause. Depending upon the size of each table, you might consider looking at map-side joins, which will allow for doing non-equality filters during a join before it’s passed to the reducers. > > Matt Tucker > > From: mahsa mofidpoor [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, March 13, 2012 1:02 PM > To: [EMAIL PROTECTED] > Subject: Re: non-equality joins > > > Hi Keith, > > Do you know exactly how an algorithm should be in order to fit in the MapReduce framework? Could you refer me to some references? > > Thanks and Regards, > Mahsa > > > > On Tue, Mar 13, 2012 at 12:49 PM, Keith Wiley <[EMAIL PROTECTED]> wrote: > https://cwiki.apache.org/Hive/languagemanual-joins.html> > "Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job." > > I admit, that isn't a very detailed answer, but it gives some indication of the reason for the discrepancy between Hive and other databases. Hive fundamentally operates on Hadoop, namely on MapReduce (we all know this, I'm just reiterating the train of thought). The problem is that certain algorithms are exceedingly difficult to wedge into the MapReduce framework. > > That is as detailed as my personal insight can get. I've done a lot of MapReduce programming in Hadoop but I'm not a database expert and I don't really understand the steps involved in various kinds of table-joins, so I don't understand the particular ways in which certain database operations do or do not fit into MapReduce...but presumably nonequality joins (whatever those are :-D ) are particularly difficult to MapReduceify. > > Cheers! > > On Mar 13, 2012, at 09:17 , mahsa mofidpoor wrote: > > > Hello, > > > > Is there a reason behind not implementing non-equality joins in Hive? In other words, is there any usage for theta-join, if implemented? > > > > Thank you in advance for your response, > > Mahsa > > > ________________________________________________________________________________ > Keith Wiley [EMAIL PROTECTED] keithwiley.com music.keithwiley.com > > "It's a fine line between meticulous and obsessive-compulsive and a slippery > rope between obsessive-compulsive and debilitatingly slow." > -- Keith Wiley > ________________________________________________________________________________ > >
+
Alan Gates 2012-03-17, 00:41
buddhika chamith 2012-03-17, 06:02
Hi, I think matt's solution is the way to go for now. If you need some basic understanding on how reduce and map side joins work see [1] whether if it helps you. Regards Buddhika [1] http://chamibuddhika.wordpress.com/2012/02/26/joins-with-map-reduce/On Sat, Mar 17, 2012 at 6:41 AM, Alan Gates <[EMAIL PROTECTED]> wrote: > There are algorithms for doing general theta-joins in parallel. Search > Google on "theta joins parallel database" and you will find some > interesting references. I am not aware of any tools that implement these > yet. You can also do it via a cross join followed by a filter, but again > you need special algorithms to do a cross in MapReduce, which Hive doesn't > implement yet. See > http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html(search for the section on Cross) for a discussion of how to do cross in > MapReduce. > > Alan. > > On Mar 13, 2012, at 10:13 AM, Tucker, Matt wrote: > > > For theta joins, you’ll have to convert the query to an equi-join, and > then filter for non-equality in the WHERE clause. Depending upon the size > of each table, you might consider looking at map-side joins, which will > allow for doing non-equality filters during a join before it’s passed to > the reducers. > > > > Matt Tucker > > > > From: mahsa mofidpoor [mailto:[EMAIL PROTECTED]] > > Sent: Tuesday, March 13, 2012 1:02 PM > > To: [EMAIL PROTECTED] > > Subject: Re: non-equality joins > > > > > > Hi Keith, > > > > Do you know exactly how an algorithm should be in order to fit in the > MapReduce framework? Could you refer me to some references? > > > > Thanks and Regards, > > Mahsa > > > > > > > > On Tue, Mar 13, 2012 at 12:49 PM, Keith Wiley <[EMAIL PROTECTED]> > wrote: > > https://cwiki.apache.org/Hive/languagemanual-joins.html> > > > "Hive does not support join conditions that are not equality conditions > as it is very difficult to express such conditions as a map/reduce job." > > > > I admit, that isn't a very detailed answer, but it gives some indication > of the reason for the discrepancy between Hive and other databases. Hive > fundamentally operates on Hadoop, namely on MapReduce (we all know this, > I'm just reiterating the train of thought). The problem is that certain > algorithms are exceedingly difficult to wedge into the MapReduce framework. > > > > That is as detailed as my personal insight can get. I've done a lot of > MapReduce programming in Hadoop but I'm not a database expert and I don't > really understand the steps involved in various kinds of table-joins, so I > don't understand the particular ways in which certain database operations > do or do not fit into MapReduce...but presumably nonequality joins > (whatever those are :-D ) are particularly difficult to MapReduceify. > > > > Cheers! > > > > On Mar 13, 2012, at 09:17 , mahsa mofidpoor wrote: > > > > > Hello, > > > > > > Is there a reason behind not implementing non-equality joins in Hive? > In other words, is there any usage for theta-join, if implemented? > > > > > > Thank you in advance for your response, > > > Mahsa > > > > > > > ________________________________________________________________________________ > > Keith Wiley [EMAIL PROTECTED] keithwiley.com > music.keithwiley.com > > > > "It's a fine line between meticulous and obsessive-compulsive and a > slippery > > rope between obsessive-compulsive and debilitatingly slow." > > -- Keith Wiley > > > ________________________________________________________________________________ > > > > > >
+
buddhika chamith 2012-03-17, 06:02
Edward Capriolo 2012-03-17, 16:04
in general hive does not offer features it can not do well. Cross joins on any data set where one table is not very small do not scale in map reduce. So there is not a big win for offering syntax for it. Not talking about pig but one very common unnamed map reduce framework offers Many features that do not paralize into map reduce. I find this framework a total 'tease'. On Saturday, March 17, 2012, buddhika chamith <[EMAIL PROTECTED]> wrote: > Hi, > > I think matt's solution is the way to go for now. If you need some basic understanding on how reduce and map side joins work see [1] whether if it helps you. > > Regards > Buddhika > > [1] http://chamibuddhika.wordpress.com/2012/02/26/joins-with-map-reduce/> > On Sat, Mar 17, 2012 at 6:41 AM, Alan Gates <[EMAIL PROTECTED]> wrote: >> >> There are algorithms for doing general theta-joins in parallel. Search Google on "theta joins parallel database" and you will find some interesting references. I am not aware of any tools that implement these yet. You can also do it via a cross join followed by a filter, but again you need special algorithms to do a cross in MapReduce, which Hive doesn't implement yet. See http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html(searchfor the section on Cross) for a discussion of how to do cross in MapReduce. >> >> Alan. >> >> On Mar 13, 2012, at 10:13 AM, Tucker, Matt wrote: >> >> > For theta joins, you’ll have to convert the query to an equi-join, and then filter for non-equality in the WHERE clause. Depending upon the size of each table, you might consider looking at map-side joins, which will allow for doing non-equality filters during a join before it’s passed to the reducers. >> > >> > Matt Tucker >> > >> > From: mahsa mofidpoor [mailto:[EMAIL PROTECTED]] >> > Sent: Tuesday, March 13, 2012 1:02 PM >> > To: [EMAIL PROTECTED] >> > Subject: Re: non-equality joins >> > >> > >> > Hi Keith, >> > >> > Do you know exactly how an algorithm should be in order to fit in the MapReduce framework? Could you refer me to some references? >> > >> > Thanks and Regards, >> > Mahsa >> > >> > >> > >> > On Tue, Mar 13, 2012 at 12:49 PM, Keith Wiley <[EMAIL PROTECTED]> wrote: >> > https://cwiki.apache.org/Hive/languagemanual-joins.html>> > >> > "Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job." >> > >> > I admit, that isn't a very detailed answer, but it gives some indication of the reason for the discrepancy between Hive and other databases. Hive fundamentally operates on Hadoop, namely on MapReduce (we all know this, I'm just reiterating the train of thought). The problem is that certain algorithms are exceedingly difficult to wedge into the MapReduce framework. >> > >> > That is as detailed as my personal insight can get. I've done a lot of MapReduce programming in Hadoop but I'm not a database expert and I don't really understand the steps involved in various kinds of table-joins, so I don't understand the particular ways in which certain database operations do or do not fit into MapReduce...but presumably nonequality joins (whatever those are :-D ) are particularly difficult to MapReduceify. >> > >> > Cheers! >> > >> > On Mar 13, 2012, at 09:17 , mahsa mofidpoor wrote: >> > >> > > Hello, >> > > >> > > Is there a reason behind not implementing non-equality joins in Hive? In other words, is there any usage for theta-join, if implemented? >> > > >> > > Thank you in advance for your response, >> > > Mahsa >> > >> > >> > ________________________________________________________________________________ >> > Keith Wiley [EMAIL PROTECTED] keithwiley.com music.keithwiley.com >> > >> > "It's a fine line between meticulous and obsessive-compulsive and a slippery >> > rope between obsessive-compulsive and debilitatingly slow." >> > -- Keith Wiley >> > ________________________________________________________________________________
+
Edward Capriolo 2012-03-17, 16:04
mahsa mofidpoor 2012-03-17, 16:39
Hi Edward, Could you please clarify what you mean in your last paragraph? You found Pig Latin a week framework in terms of MapReduce? Thanks again for the response. Mahsa On Sat, Mar 17, 2012 at 12:04 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote: > in general hive does not offer features it can not do well. Cross joins on > any data set where one table is not very small do not scale in map reduce. > So there is not a big win for offering syntax for it. > > Not talking about pig but one very common unnamed map reduce framework > offers Many features that do not paralize into map reduce. I find this > framework a total 'tease'. > > On Saturday, March 17, 2012, buddhika chamith <[EMAIL PROTECTED]> > wrote: > > Hi, > > > > I think matt's solution is the way to go for now. If you need some basic > understanding on how reduce and map side joins work see [1] whether if it > helps you. > > > > Regards > > Buddhika > > > > [1] http://chamibuddhika.wordpress.com/2012/02/26/joins-with-map-reduce/> > > > On Sat, Mar 17, 2012 at 6:41 AM, Alan Gates <[EMAIL PROTECTED]> > wrote: > >> > >> There are algorithms for doing general theta-joins in parallel. Search > Google on "theta joins parallel database" and you will find some > interesting references. I am not aware of any tools that implement these > yet. You can also do it via a cross join followed by a filter, but again > you need special algorithms to do a cross in MapReduce, which Hive doesn't > implement yet. See > http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html(search for the section on Cross) for a discussion of how to do cross in > MapReduce. > >> > >> Alan. > >> > >> On Mar 13, 2012, at 10:13 AM, Tucker, Matt wrote: > >> > >> > For theta joins, you’ll have to convert the query to an equi-join, > and then filter for non-equality in the WHERE clause. Depending upon the > size of each table, you might consider looking at map-side joins, which > will allow for doing non-equality filters during a join before it’s passed > to the reducers. > >> > > >> > Matt Tucker > >> > > >> > From: mahsa mofidpoor [mailto:[EMAIL PROTECTED]] > >> > Sent: Tuesday, March 13, 2012 1:02 PM > >> > To: [EMAIL PROTECTED] > >> > Subject: Re: non-equality joins > >> > > >> > > >> > Hi Keith, > >> > > >> > Do you know exactly how an algorithm should be in order to fit in the > MapReduce framework? Could you refer me to some references? > >> > > >> > Thanks and Regards, > >> > Mahsa > >> > > >> > > >> > > >> > On Tue, Mar 13, 2012 at 12:49 PM, Keith Wiley <[EMAIL PROTECTED]> > wrote: > >> > https://cwiki.apache.org/Hive/languagemanual-joins.html> >> > > >> > "Hive does not support join conditions that are not equality > conditions as it is very difficult to express such conditions as a > map/reduce job." > >> > > >> > I admit, that isn't a very detailed answer, but it gives some > indication of the reason for the discrepancy between Hive and other > databases. Hive fundamentally operates on Hadoop, namely on MapReduce (we > all know this, I'm just reiterating the train of thought). The problem is > that certain algorithms are exceedingly difficult to wedge into the > MapReduce framework. > >> > > >> > That is as detailed as my personal insight can get. I've done a lot > of MapReduce programming in Hadoop but I'm not a database expert and I > don't really understand the steps involved in various kinds of table-joins, > so I don't understand the particular ways in which certain database > operations do or do not fit into MapReduce...but presumably nonequality > joins (whatever those are :-D ) are particularly difficult to MapReduceify. > >> > > >> > Cheers! > >> > > >> > On Mar 13, 2012, at 09:17 , mahsa mofidpoor wrote: > >> > > >> > > Hello, > >> > > > >> > > Is there a reason behind not implementing non-equality joins in > Hive? In other words, is there any usage for theta-join, if implemented? > >> > > > >> > > Thank you in advance for your response, > >
+
mahsa mofidpoor 2012-03-17, 16:39
|
|