Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Re: How to efficiently join HBase tables?


+
Florin P 2011-06-16, 12:44
+
Buttler, David 2011-06-17, 00:02
+
Eran Kutner 2011-05-31, 12:06
+
Ferdy Galema 2011-05-31, 12:31
+
Eran Kutner 2011-05-31, 12:43
+
Michael Segel 2011-05-31, 14:20
+
Doug Meil 2011-05-31, 14:22
+
Michael Segel 2011-05-31, 14:56
+
Doug Meil 2011-05-31, 15:42
+
Eran Kutner 2011-05-31, 18:42
+
Michael Segel 2011-05-31, 20:09
+
Michael Segel 2011-05-31, 18:56
+
Ted Dunning 2011-05-31, 19:02
+
Eran Kutner 2011-05-31, 19:19
+
Ted Dunning 2011-05-31, 20:10
+
Patrick Angeles 2011-05-31, 20:41
+
Jason Rutherglen 2011-06-01, 00:18
+
Bill Graham 2011-06-01, 00:35
+
Jason Rutherglen 2011-06-01, 00:41
+
Eran Kutner 2011-06-01, 10:50
+
Lars George 2011-06-01, 13:54
+
Jason Rutherglen 2011-06-01, 14:47
+
Michael Segel 2011-06-02, 21:05
+
Eran Kutner 2011-06-03, 07:23
+
Buttler, David 2011-06-06, 20:30
+
Doug Meil 2011-06-06, 21:19
+
Michael Segel 2011-06-07, 02:08
+
Doug Meil 2011-06-08, 13:01
+
Eran Kutner 2011-06-08, 18:47
+
Buttler, David 2011-06-08, 20:45
+
Dave Latham 2011-06-08, 21:35
+
Buttler, David 2011-06-08, 23:02
+
Eran Kutner 2011-06-09, 09:35
Copy link to this message
-
Re: How to efficiently join HBase tables?
>> This sounds effective to me, so long as you can perform any desired
>> operations on the all the rows matching a single join value via a single
>> iteration through the stream of reduce input values (for example, if the
>> set
>> of data for each join value fits in memory).  Otherwise you'd need to put
>> the list of matches from table A some place that you can iterate over it
>> again for each match in table B.
This is why I was suggesting using the temp tables and not trying to do it as a sing map reduce job.  When your data sets get very large you will have problems.........

;-)

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jun 9, 2011, at 4:35 AM, Eran Kutner <[EMAIL PROTECTED]> wrote:

> Exactly!
> Thanks Dave for a much better explanation than mine!
>
> -eran
>
>
>
> On Thu, Jun 9, 2011 at 00:35, Dave Latham <[EMAIL PROTECTED]> wrote:
>
>> I believe this is what Eran is suggesting:
>>
>> Table A
>> -------
>> Row1 (has joinVal_1)
>> Row2 (has joinVal_2)
>> Row3 (has joinVal_1)
>>
>> Table B
>> -------
>> Row4 (has joinVal_1)
>> Row5 (has joinVal_3)
>> Row6 (has joinVal_2)
>>
>> Mapper receives a list of input rows (union of both input tables in any
>> order), and produces (=>) intermediate key, value pairs, where the key is
>> the join field, and the value is whatever portion of the row you want
>> available in your output
>>
>> Map
>> ----------
>> A, Row1 => (joinVal_1, [A,Row1])
>> A, Row2 => (joinVal_2, [A,Row2])
>> A, Row3 => (joinVal_1, [A,Row3])
>> B, Row4 => (joinVal_1, [B,Row4])
>> B, Row5 => (joinVal_3, [B,Row5])
>> B, Row6 => (joinVal_2, [B,Row6])
>>
>> Shuffle phase partitions and sorts by the map output key (which is the join
>> value)
>> The Reduce phase then gets a key for the join value and a list of values
>> containing all of the input rows (from either table) with that join value.
>> It can then perform whatever operations you want (like enumerate the subset
>> of the Cartesian product for that join value)
>>
>> Reduce
>> ------------
>> joinVal_1, {[A,Row1], [A,Row3], [B,Row4]} => Row1 x Row4, Row3 x Row4
>> joinVal_2, {[A,Row2], [B,Row6]} => Row2 x Row6
>> joinVal_3, {[A,Row3]} => {}
>>
>>>> This sounds effective to me, so long as you can perform any desired
>>>> operations on the all the rows matching a single join value via a single
>>>> iteration through the stream of reduce input values (for example, if the
>>>> set
>>>> of data for each join value fits in memory).  Otherwise you'd need to put
>>>> the list of matches from table A some place that you can iterate over it
>>>> again for each match in table B.
>>
>> Dave
>>
>> On Wed, Jun 8, 2011 at 1:45 PM, Buttler, David <[EMAIL PROTECTED]> wrote:
>>
>>> Let's make a toy example to see if we can capture all of the edge
>>> conditions:
>>> Table A
>>> -------
>>> Key1 joinVal_1
>>> Key2 joinVal_2
>>> Key3 joinVal_1
>>>
>>> Table B
>>> -------
>>> Key4 joinVal_1
>>> Key5 joinVal_3
>>> Key6 joinVal_2
>>>
>>> Now, assume that we have a mapper that takes two values, one row from A,
>>> and one row from B.  Are you suggesting that we get the following map
>> calls:
>>> Key1 & key4
>>> Key2 & key5
>>> Key3 & key6
>>>
>>> Or are you suggesting we get the following:
>>> Key1 & key4
>>> Key1 & key5
>>> Key1 & key6
>>> Key2 & key4
>>> Key2 & key5
>>> Key2 & key6
>>> Key3 & key4
>>> Key3 & key5
>>> Key3 & key6
>>>
>>> Or are you suggesting something different?
>>>
>>> Dave
>>>
>>> -----Original Message-----
>>> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Eran
>>> Kutner
>>> Sent: Wednesday, June 08, 2011 11:47 AM
>>> To: [EMAIL PROTECTED]
>>> Subject: Re: How to efficiently join HBase tables?
>>>
>>> I'd like to clarify, again what I'm trying to do and why I still think
>> it's
>>> the best way to do it.
>>> I want to join two large tables, I'm assuming, and this is the key to the
>>> efficiency of this method, that: 1) I'm getting a lot of data from table
>> A,
>>> something which is close enough top a full table scan, and 2) this
+
Michel Segel 2011-06-08, 14:14
+
Doug Meil 2011-06-09, 02:56
+
Michel Segel 2011-06-09, 12:02
+
Doug Meil 2011-05-31, 19:39
+
Michael Segel 2011-05-31, 20:18
+
Jason Rutherglen 2011-05-31, 18:48
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB