Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Re: How to efficiently join HBase tables?


+
Florin P 2011-06-16, 12:44
+
Buttler, David 2011-06-17, 00:02
+
Eran Kutner 2011-05-31, 12:06
+
Ferdy Galema 2011-05-31, 12:31
+
Eran Kutner 2011-05-31, 12:43
+
Michael Segel 2011-05-31, 14:20
+
Doug Meil 2011-05-31, 14:22
+
Michael Segel 2011-05-31, 14:56
+
Doug Meil 2011-05-31, 15:42
+
Eran Kutner 2011-05-31, 18:42
+
Michael Segel 2011-05-31, 20:09
+
Michael Segel 2011-05-31, 18:56
+
Ted Dunning 2011-05-31, 19:02
+
Eran Kutner 2011-05-31, 19:19
+
Ted Dunning 2011-05-31, 20:10
+
Patrick Angeles 2011-05-31, 20:41
+
Jason Rutherglen 2011-06-01, 00:18
+
Bill Graham 2011-06-01, 00:35
+
Jason Rutherglen 2011-06-01, 00:41
+
Eran Kutner 2011-06-01, 10:50
+
Lars George 2011-06-01, 13:54
+
Jason Rutherglen 2011-06-01, 14:47
+
Michael Segel 2011-06-02, 21:05
+
Eran Kutner 2011-06-03, 07:23
+
Buttler, David 2011-06-06, 20:30
+
Doug Meil 2011-06-06, 21:19
+
Michael Segel 2011-06-07, 02:08
+
Doug Meil 2011-06-08, 13:01
+
Eran Kutner 2011-06-08, 18:47
+
Buttler, David 2011-06-08, 20:45
+
Dave Latham 2011-06-08, 21:35
+
Buttler, David 2011-06-08, 23:02
+
Eran Kutner 2011-06-09, 09:35
+
Michel Segel 2011-06-09, 12:09
Copy link to this message
-
Re: How to efficiently join HBase tables?
Michel Segel 2011-06-08, 14:14
Unless I am mistaken... get() requires a row key, right?
And you can join tables on column data which isn't in the row key, right?

So how do you do a get()? :-)

Sure there is more than one way to skin a cat. But if you want to be efficient... You will create a set of unique keys based on the columns that you want to join. Note that if you are going to use a temp table in hbase, you will want to store the unique key value A|B and when you write the row to the temp table, you will append an unique identifier like a uuid so that you don't lose the row.

Here your input list to the actual join is going to be the list of unique keys and then you do a scan to get the rows.

Again, I could be wrong but how can you perform a get() when you only know a portion of the row key?

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jun 8, 2011, at 8:01 AM, Doug Meil <[EMAIL PROTECTED]> wrote:

>
> Re: " With respect to Doug's posts, you can't do a multi-get off the bat"
>
> That's an assumption, but you're entitled to your opinion.
>
> -----Original Message-----
> From: Michael Segel [mailto:[EMAIL PROTECTED]]
> Sent: Monday, June 06, 2011 10:08 PM
> To: [EMAIL PROTECTED]
> Subject: RE: How to efficiently join HBase tables?
>
>
> Well....
>
> David, is correct.
>
> Eran wanted to do a join which is a relational concept that isn't natively supported by a NoSQL database. A better model would be a hierarchical model like Dick Pick's Revelation. (Univers aka U2 from Ardent/Informix/IBM/now JRockit?).
> And yes, we're looking back 40 some odd years in to either a merge/sort solution or how databases do a relational join. :-)
>
> Eran wants to do this in a single m/r job. The short answer is you can't.  Longer answer is that if your main class implements Tool Runner, you can launch two jobs in parallel to get your subsets, and then when they both complete, you run the join job on them. So I guess its a single 'job' or rather app. :-)
>
> With respect to Doug's posts, you can't do a multi-get off the bat because in the general case you're not fetching based on the row key but a column which is not part of the row key. (It could be a foreign key which would mean that at least one of your table fetches will be off the row key but you can't guarantee it.)
>
> So if you don't want to use temp tables, then you have to put your results in a sorted order, and you still want to get the unique set of the join-keys which means you have to run a reduce job. Then you can use the unique key set and then do the scans. (You can't do a multi-get because you're doing a scan with a start and stop row(s).)
>
> The reason I suggest that if you're going to do a join operation, you want to use temp tables because it makes your life easier and probably faster too.
>
> Bottom line... I guess many data architects are going to need rethink their data models when working on big data. :-)
>
> -Mike
>
> PS. If I get a spare moment, I may code this up...
>
>
>> From: [EMAIL PROTECTED]
>> To: [EMAIL PROTECTED]
>> Date: Mon, 6 Jun 2011 17:19:44 -0400
>> Subject: RE: How to efficiently join HBase tables?
>>
>> Re:  " So, you all realize the joins have been talked about in the database community for 40 years?"
>>
>> Great point.  What's old is new!    :-)
>>
>> My suggested from earlier in the thread was a variant of nested loops by using multi-get in HTable, which would reduce the number of RPC calls.  So it's a "bulk-select nested loops" of sorts (i.e., as opposed to the 1-by-1 lookup of regular nested loops).
>>
>>
>> -----Original Message-----
>> From: Buttler, David [mailto:[EMAIL PROTECTED]]
>> Sent: Monday, June 06, 2011 4:30 PM
>> To: [EMAIL PROTECTED]
>> Subject: RE: How to efficiently join HBase tables?
>>
>> So, you all realize the joins have been talked about in the database community for 40 years?  There are two main types of joins:
>> Nested loops
>> Hash table
>>
>> Mike, in his various emails seems to be trying to re-imagine how to implement both types of joins in HBase (which seems like a reasonable goal). I am not exactly sure what Eran is going for here, but it seems like Eran is glossing over a piece.  If you have two scanners for table A and B, then table B needs to be rescanned for every unique part of the join condition in table A.  There are certain ways of improving the efficiency of that: the two most obvious are pushing the selection criteria down to the scans, and scanning all of the same join values from table B at the same time (which requires that Table B's key is the join, or a secondary structure that stores the join values as the primary order).
+
Doug Meil 2011-06-09, 02:56
+
Michel Segel 2011-06-09, 12:02
+
Doug Meil 2011-05-31, 19:39
+
Michael Segel 2011-05-31, 20:18
+
Jason Rutherglen 2011-05-31, 18:48