Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - How to efficiently join HBase tables?


Copy link to this message
-
Re: How to efficiently join HBase tables?
Eran Kutner 2011-05-31, 12:43
MutipleInputs would be ideal, but that seems pretty complicated.
MultiTableInputFormat seems like a simple change in the getSplits() method
of TableInputFormat + support for a collection of table and their matching
scanners instead of a single table and scanner, doesn't sound too
complicated.
Any other suggestions?

-eran

On Tue, May 31, 2011 at 15:31, Ferdy Galema <[EMAIL PROTECTED]>wrote:

> As far as I can tell there is not yet a build-in mechanism you can use for
> this. You could implement your own InputFormat, something like
> MultiTableInputFormat. If you need different map functions for the two
> tables, perhaps something similar to Hadoop's MultipleInputs should do the
> trick.
>
>
> On 05/31/2011 02:06 PM, Eran Kutner wrote:
>
>> Hi,
>> I need to join two HBase tables. The obvious way is to use a M/R job for
>> that. The problem is that the few references to that question I found
>> recommend pulling one table to the mapper and then do a lookup for the
>> referred row in the second table.
>> This sounds like a very inefficient way to do  join with map reduce. I
>> believe it would be much better to feed the rows of both tables to the
>> mapper and let it emit a key based on the join fields. Since all the rows
>> with the same join fields values will have the same key the reducer will
>> be
>> able to easily generate the result of the join.
>> The problem with this is that I couldn't find a way to feed two tables to
>> a
>> single map reduce job. I could probably dump the tables to files in a
>> single
>> directory and then run the join on the files but that really makes no
>> sense.
>>
>> Am I missing something? Any other ideas?
>>
>> -eran
>>
>>