Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> How to efficiently join HBase tables?

Copy link to this message
Re: How to efficiently join HBase tables?
I'd like to clarify, again what I'm trying to do and why I still think it's
the best way to do it.
I want to join two large tables, I'm assuming, and this is the key to the
efficiency of this method, that: 1) I'm getting a lot of data from table A,
something which is close enough top a full table scan, and 2) this implies
that I will need to join with most of table B as well.
All the suggestions from the SQL world are doing lookups, one way or another
in table B. My suggestion is to use the power of the shuffle phase to do the
join. It is obviously doable, so I don't understand the statement that it
can't be done.
So to go over it again:
1. You feed all the rows from table A and B into the mapper.
2. For each row, the mapper should output a new row with a key constructed
from the join fields and a value which is the row itself (same as the input
value it got).
3. The shuffle phase will make sure all rows with the same values in the
join fields will end up together.
4. The reducer will get all the rows for a single set of join field values
together and perform the actual join. The reducer can be programmed to do an
inner or outer join at this point.

I can't prove it without actually writing and testing it but I have a strong
feeling this will be much more efficient for large joins than any form of


On Wed, Jun 8, 2011 at 16:01, Doug Meil <[EMAIL PROTECTED]>wrote:

> Re: " With respect to Doug's posts, you can't do a multi-get off the bat"
> That's an assumption, but you're entitled to your opinion.
> -----Original Message-----
> From: Michael Segel [mailto:[EMAIL PROTECTED]]
> Sent: Monday, June 06, 2011 10:08 PM
> Subject: RE: How to efficiently join HBase tables?
> Well....
> David, is correct.
> Eran wanted to do a join which is a relational concept that isn't natively
> supported by a NoSQL database. A better model would be a hierarchical model
> like Dick Pick's Revelation. (Univers aka U2 from Ardent/Informix/IBM/now
> JRockit?).
> And yes, we're looking back 40 some odd years in to either a merge/sort
> solution or how databases do a relational join. :-)
> Eran wants to do this in a single m/r job. The short answer is you can't.
>  Longer answer is that if your main class implements Tool Runner, you can
> launch two jobs in parallel to get your subsets, and then when they both
> complete, you run the join job on them. So I guess its a single 'job' or
> rather app. :-)
> With respect to Doug's posts, you can't do a multi-get off the bat because
> in the general case you're not fetching based on the row key but a column
> which is not part of the row key. (It could be a foreign key which would
> mean that at least one of your table fetches will be off the row key but you
> can't guarantee it.)
> So if you don't want to use temp tables, then you have to put your results
> in a sorted order, and you still want to get the unique set of the join-keys
> which means you have to run a reduce job. Then you can use the unique key
> set and then do the scans. (You can't do a multi-get because you're doing a
> scan with a start and stop row(s).)
> The reason I suggest that if you're going to do a join operation, you want
> to use temp tables because it makes your life easier and probably faster
> too.
> Bottom line... I guess many data architects are going to need rethink their
> data models when working on big data. :-)
> -Mike
> PS. If I get a spare moment, I may code this up...
> > Date: Mon, 6 Jun 2011 17:19:44 -0400
> > Subject: RE: How to efficiently join HBase tables?
> >
> > Re:  " So, you all realize the joins have been talked about in the
> database community for 40 years?"
> >
> > Great point.  What's old is new!    :-)
> >
> > My suggested from earlier in the thread was a variant of nested loops by
> using multi-get in HTable, which would reduce the number of RPC calls.  So