Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> How to efficiently join HBase tables?


Copy link to this message
-
Re: How to efficiently join HBase tables?
I believe this is what Eran is suggesting:

Table A
-------
Row1 (has joinVal_1)
Row2 (has joinVal_2)
Row3 (has joinVal_1)

Table B
-------
Row4 (has joinVal_1)
Row5 (has joinVal_3)
Row6 (has joinVal_2)

Mapper receives a list of input rows (union of both input tables in any
order), and produces (=>) intermediate key, value pairs, where the key is
the join field, and the value is whatever portion of the row you want
available in your output

Map
----------
A, Row1 => (joinVal_1, [A,Row1])
A, Row2 => (joinVal_2, [A,Row2])
A, Row3 => (joinVal_1, [A,Row3])
B, Row4 => (joinVal_1, [B,Row4])
B, Row5 => (joinVal_3, [B,Row5])
B, Row6 => (joinVal_2, [B,Row6])

Shuffle phase partitions and sorts by the map output key (which is the join
value)
The Reduce phase then gets a key for the join value and a list of values
containing all of the input rows (from either table) with that join value.
It can then perform whatever operations you want (like enumerate the subset
of the Cartesian product for that join value)

Reduce
------------
joinVal_1, {[A,Row1], [A,Row3], [B,Row4]} => Row1 x Row4, Row3 x Row4
joinVal_2, {[A,Row2], [B,Row6]} => Row2 x Row6
joinVal_3, {[A,Row3]} => {}

This sounds effective to me, so long as you can perform any desired
operations on the all the rows matching a single join value via a single
iteration through the stream of reduce input values (for example, if the set
of data for each join value fits in memory).  Otherwise you'd need to put
the list of matches from table A some place that you can iterate over it
again for each match in table B.

Dave

On Wed, Jun 8, 2011 at 1:45 PM, Buttler, David <[EMAIL PROTECTED]> wrote:

> Let's make a toy example to see if we can capture all of the edge
> conditions:
> Table A
> -------
> Key1 joinVal_1
> Key2 joinVal_2
> Key3 joinVal_1
>
> Table B
> -------
> Key4 joinVal_1
> Key5 joinVal_3
> Key6 joinVal_2
>
> Now, assume that we have a mapper that takes two values, one row from A,
> and one row from B.  Are you suggesting that we get the following map calls:
> Key1 & key4
> Key2 & key5
> Key3 & key6
>
> Or are you suggesting we get the following:
> Key1 & key4
> Key1 & key5
> Key1 & key6
> Key2 & key4
> Key2 & key5
> Key2 & key6
> Key3 & key4
> Key3 & key5
> Key3 & key6
>
> Or are you suggesting something different?
>
> Dave
>
> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Eran
> Kutner
> Sent: Wednesday, June 08, 2011 11:47 AM
> To: [EMAIL PROTECTED]
> Subject: Re: How to efficiently join HBase tables?
>
> I'd like to clarify, again what I'm trying to do and why I still think it's
> the best way to do it.
> I want to join two large tables, I'm assuming, and this is the key to the
> efficiency of this method, that: 1) I'm getting a lot of data from table A,
> something which is close enough top a full table scan, and 2) this implies
> that I will need to join with most of table B as well.
> All the suggestions from the SQL world are doing lookups, one way or
> another
> in table B. My suggestion is to use the power of the shuffle phase to do
> the
> join. It is obviously doable, so I don't understand the statement that it
> can't be done.
> So to go over it again:
> 1. You feed all the rows from table A and B into the mapper.
> 2. For each row, the mapper should output a new row with a key constructed
> from the join fields and a value which is the row itself (same as the input
> value it got).
> 3. The shuffle phase will make sure all rows with the same values in the
> join fields will end up together.
> 4. The reducer will get all the rows for a single set of join field values
> together and perform the actual join. The reducer can be programmed to do
> an
> inner or outer join at this point.
>
> I can't prove it without actually writing and testing it but I have a
> strong
> feeling this will be much more efficient for large joins than any form of
> lookup.
>
> -eran
>
>
>
> On Wed, Jun 8, 2011 at 16:01, Doug Meil <[EMAIL PROTECTED]