Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> How to efficiently join HBase tables?


Copy link to this message
-
Re: How to efficiently join HBase tables?
Exactly!
Thanks Dave for a much better explanation than mine!

-eran

On Thu, Jun 9, 2011 at 00:35, Dave Latham <[EMAIL PROTECTED]> wrote:

> I believe this is what Eran is suggesting:
>
> Table A
> -------
> Row1 (has joinVal_1)
> Row2 (has joinVal_2)
> Row3 (has joinVal_1)
>
> Table B
> -------
> Row4 (has joinVal_1)
> Row5 (has joinVal_3)
> Row6 (has joinVal_2)
>
> Mapper receives a list of input rows (union of both input tables in any
> order), and produces (=>) intermediate key, value pairs, where the key is
> the join field, and the value is whatever portion of the row you want
> available in your output
>
> Map
> ----------
> A, Row1 => (joinVal_1, [A,Row1])
> A, Row2 => (joinVal_2, [A,Row2])
> A, Row3 => (joinVal_1, [A,Row3])
> B, Row4 => (joinVal_1, [B,Row4])
> B, Row5 => (joinVal_3, [B,Row5])
> B, Row6 => (joinVal_2, [B,Row6])
>
> Shuffle phase partitions and sorts by the map output key (which is the join
> value)
> The Reduce phase then gets a key for the join value and a list of values
> containing all of the input rows (from either table) with that join value.
> It can then perform whatever operations you want (like enumerate the subset
> of the Cartesian product for that join value)
>
> Reduce
> ------------
> joinVal_1, {[A,Row1], [A,Row3], [B,Row4]} => Row1 x Row4, Row3 x Row4
> joinVal_2, {[A,Row2], [B,Row6]} => Row2 x Row6
> joinVal_3, {[A,Row3]} => {}
>
> This sounds effective to me, so long as you can perform any desired
> operations on the all the rows matching a single join value via a single
> iteration through the stream of reduce input values (for example, if the
> set
> of data for each join value fits in memory).  Otherwise you'd need to put
> the list of matches from table A some place that you can iterate over it
> again for each match in table B.
>
> Dave
>
> On Wed, Jun 8, 2011 at 1:45 PM, Buttler, David <[EMAIL PROTECTED]> wrote:
>
> > Let's make a toy example to see if we can capture all of the edge
> > conditions:
> > Table A
> > -------
> > Key1 joinVal_1
> > Key2 joinVal_2
> > Key3 joinVal_1
> >
> > Table B
> > -------
> > Key4 joinVal_1
> > Key5 joinVal_3
> > Key6 joinVal_2
> >
> > Now, assume that we have a mapper that takes two values, one row from A,
> > and one row from B.  Are you suggesting that we get the following map
> calls:
> > Key1 & key4
> > Key2 & key5
> > Key3 & key6
> >
> > Or are you suggesting we get the following:
> > Key1 & key4
> > Key1 & key5
> > Key1 & key6
> > Key2 & key4
> > Key2 & key5
> > Key2 & key6
> > Key3 & key4
> > Key3 & key5
> > Key3 & key6
> >
> > Or are you suggesting something different?
> >
> > Dave
> >
> > -----Original Message-----
> > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Eran
> > Kutner
> > Sent: Wednesday, June 08, 2011 11:47 AM
> > To: [EMAIL PROTECTED]
> > Subject: Re: How to efficiently join HBase tables?
> >
> > I'd like to clarify, again what I'm trying to do and why I still think
> it's
> > the best way to do it.
> > I want to join two large tables, I'm assuming, and this is the key to the
> > efficiency of this method, that: 1) I'm getting a lot of data from table
> A,
> > something which is close enough top a full table scan, and 2) this
> implies
> > that I will need to join with most of table B as well.
> > All the suggestions from the SQL world are doing lookups, one way or
> > another
> > in table B. My suggestion is to use the power of the shuffle phase to do
> > the
> > join. It is obviously doable, so I don't understand the statement that it
> > can't be done.
> > So to go over it again:
> > 1. You feed all the rows from table A and B into the mapper.
> > 2. For each row, the mapper should output a new row with a key
> constructed
> > from the join fields and a value which is the row itself (same as the
> input
> > value it got).
> > 3. The shuffle phase will make sure all rows with the same values in the
> > join fields will end up together.
> > 4. The reducer will get all the rows for a single set of join field