Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - How to efficiently join HBase tables?


Copy link to this message
-
Re: How to efficiently join HBase tables?
Michel Segel 2011-06-09, 12:09
>> This sounds effective to me, so long as you can perform any desired
>> operations on the all the rows matching a single join value via a single
>> iteration through the stream of reduce input values (for example, if the
>> set
>> of data for each join value fits in memory).  Otherwise you'd need to put
>> the list of matches from table A some place that you can iterate over it
>> again for each match in table B.
This is why I was suggesting using the temp tables and not trying to do it as a sing map reduce job.  When your data sets get very large you will have problems.........

;-)

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jun 9, 2011, at 4:35 AM, Eran Kutner <[EMAIL PROTECTED]> wrote:

> Exactly!
> Thanks Dave for a much better explanation than mine!
>
> -eran
>
>
>
> On Thu, Jun 9, 2011 at 00:35, Dave Latham <[EMAIL PROTECTED]> wrote:
>
>> I believe this is what Eran is suggesting:
>>
>> Table A
>> -------
>> Row1 (has joinVal_1)
>> Row2 (has joinVal_2)
>> Row3 (has joinVal_1)
>>
>> Table B
>> -------
>> Row4 (has joinVal_1)
>> Row5 (has joinVal_3)
>> Row6 (has joinVal_2)
>>
>> Mapper receives a list of input rows (union of both input tables in any
>> order), and produces (=>) intermediate key, value pairs, where the key is
>> the join field, and the value is whatever portion of the row you want
>> available in your output
>>
>> Map
>> ----------
>> A, Row1 => (joinVal_1, [A,Row1])
>> A, Row2 => (joinVal_2, [A,Row2])
>> A, Row3 => (joinVal_1, [A,Row3])
>> B, Row4 => (joinVal_1, [B,Row4])
>> B, Row5 => (joinVal_3, [B,Row5])
>> B, Row6 => (joinVal_2, [B,Row6])
>>
>> Shuffle phase partitions and sorts by the map output key (which is the join
>> value)
>> The Reduce phase then gets a key for the join value and a list of values
>> containing all of the input rows (from either table) with that join value.
>> It can then perform whatever operations you want (like enumerate the subset
>> of the Cartesian product for that join value)
>>
>> Reduce
>> ------------
>> joinVal_1, {[A,Row1], [A,Row3], [B,Row4]} => Row1 x Row4, Row3 x Row4
>> joinVal_2, {[A,Row2], [B,Row6]} => Row2 x Row6
>> joinVal_3, {[A,Row3]} => {}
>>
>>>> This sounds effective to me, so long as you can perform any desired
>>>> operations on the all the rows matching a single join value via a single
>>>> iteration through the stream of reduce input values (for example, if the
>>>> set
>>>> of data for each join value fits in memory).  Otherwise you'd need to put
>>>> the list of matches from table A some place that you can iterate over it
>>>> again for each match in table B.
>>
>> Dave
>>
>> On Wed, Jun 8, 2011 at 1:45 PM, Buttler, David <[EMAIL PROTECTED]> wrote:
>>
>>> Let's make a toy example to see if we can capture all of the edge
>>> conditions:
>>> Table A
>>> -------
>>> Key1 joinVal_1
>>> Key2 joinVal_2
>>> Key3 joinVal_1
>>>
>>> Table B
>>> -------
>>> Key4 joinVal_1
>>> Key5 joinVal_3
>>> Key6 joinVal_2
>>>
>>> Now, assume that we have a mapper that takes two values, one row from A,
>>> and one row from B.  Are you suggesting that we get the following map
>> calls:
>>> Key1 & key4
>>> Key2 & key5
>>> Key3 & key6
>>>
>>> Or are you suggesting we get the following:
>>> Key1 & key4
>>> Key1 & key5
>>> Key1 & key6
>>> Key2 & key4
>>> Key2 & key5
>>> Key2 & key6
>>> Key3 & key4
>>> Key3 & key5
>>> Key3 & key6
>>>
>>> Or are you suggesting something different?
>>>
>>> Dave
>>>
>>> -----Original Message-----
>>> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Eran
>>> Kutner
>>> Sent: Wednesday, June 08, 2011 11:47 AM
>>> To: [EMAIL PROTECTED]
>>> Subject: Re: How to efficiently join HBase tables?
>>>
>>> I'd like to clarify, again what I'm trying to do and why I still think
>> it's
>>> the best way to do it.
>>> I want to join two large tables, I'm assuming, and this is the key to the
>>> efficiency of this method, that: 1) I'm getting a lot of data from table
>> A,
>>> something which is close enough top a full table scan, and 2) this