Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> How to efficiently join HBase tables?


Copy link to this message
-
Re: How to efficiently join HBase tables?
Exactly!
Thanks Dave for a much better explanation than mine!

-eran

On Thu, Jun 9, 2011 at 00:35, Dave Latham <[EMAIL PROTECTED]> wrote:

> I believe this is what Eran is suggesting:
>
> Table A
> -------
> Row1 (has joinVal_1)
> Row2 (has joinVal_2)
> Row3 (has joinVal_1)
>
> Table B
> -------
> Row4 (has joinVal_1)
> Row5 (has joinVal_3)
> Row6 (has joinVal_2)
>
> Mapper receives a list of input rows (union of both input tables in any
> order), and produces (=>) intermediate key, value pairs, where the key is
> the join field, and the value is whatever portion of the row you want
> available in your output
>
> Map
> ----------
> A, Row1 => (joinVal_1, [A,Row1])
> A, Row2 => (joinVal_2, [A,Row2])
> A, Row3 => (joinVal_1, [A,Row3])
> B, Row4 => (joinVal_1, [B,Row4])
> B, Row5 => (joinVal_3, [B,Row5])
> B, Row6 => (joinVal_2, [B,Row6])
>
> Shuffle phase partitions and sorts by the map output key (which is the join
> value)
> The Reduce phase then gets a key for the join value and a list of values
> containing all of the input rows (from either table) with that join value.
> It can then perform whatever operations you want (like enumerate the subset
> of the Cartesian product for that join value)
>
> Reduce
> ------------
> joinVal_1, {[A,Row1], [A,Row3], [B,Row4]} => Row1 x Row4, Row3 x Row4
> joinVal_2, {[A,Row2], [B,Row6]} => Row2 x Row6
> joinVal_3, {[A,Row3]} => {}
>
> This sounds effective to me, so long as you can perform any desired
> operations on the all the rows matching a single join value via a single
> iteration through the stream of reduce input values (for example, if the
> set
> of data for each join value fits in memory).  Otherwise you'd need to put
> the list of matches from table A some place that you can iterate over it
> again for each match in table B.
>
> Dave
>
> On Wed, Jun 8, 2011 at 1:45 PM, Buttler, David <[EMAIL PROTECTED]> wrote:
>
> > Let's make a toy example to see if we can capture all of the edge
> > conditions:
> > Table A
> > -------
> > Key1 joinVal_1
> > Key2 joinVal_2
> > Key3 joinVal_1
> >
> > Table B
> > -------
> > Key4 joinVal_1
> > Key5 joinVal_3
> > Key6 joinVal_2
> >
> > Now, assume that we have a mapper that takes two values, one row from A,
> > and one row from B.  Are you suggesting that we get the following map
> calls:
> > Key1 & key4
> > Key2 & key5
> > Key3 & key6
> >
> > Or are you suggesting we get the following:
> > Key1 & key4
> > Key1 & key5
> > Key1 & key6
> > Key2 & key4
> > Key2 & key5
> > Key2 & key6
> > Key3 & key4
> > Key3 & key5
> > Key3 & key6
> >
> > Or are you suggesting something different?
> >
> > Dave
> >
> > -----Original Message-----
> > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Eran
> > Kutner
> > Sent: Wednesday, June 08, 2011 11:47 AM
> > To: [EMAIL PROTECTED]
> > Subject: Re: How to efficiently join HBase tables?
> >
> > I'd like to clarify, again what I'm trying to do and why I still think
> it's
> > the best way to do it.
> > I want to join two large tables, I'm assuming, and this is the key to the
> > efficiency of this method, that: 1) I'm getting a lot of data from table
> A,
> > something which is close enough top a full table scan, and 2) this
> implies
> > that I will need to join with most of table B as well.
> > All the suggestions from the SQL world are doing lookups, one way or
> > another
> > in table B. My suggestion is to use the power of the shuffle phase to do
> > the
> > join. It is obviously doable, so I don't understand the statement that it
> > can't be done.
> > So to go over it again:
> > 1. You feed all the rows from table A and B into the mapper.
> > 2. For each row, the mapper should output a new row with a key
> constructed
> > from the join fields and a value which is the row itself (same as the
> input
> > value it got).
> > 3. The shuffle phase will make sure all rows with the same values in the
> > join fields will end up together.
> > 4. The reducer will get all the rows for a single set of join field
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB