Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> How to efficiently join HBase tables?


Copy link to this message
-
Re: How to efficiently join HBase tables?
MutipleInputs would be ideal, but that seems pretty complicated.
MultiTableInputFormat seems like a simple change in the getSplits() method
of TableInputFormat + support for a collection of table and their matching
scanners instead of a single table and scanner, doesn't sound too
complicated.
Any other suggestions?

-eran

On Tue, May 31, 2011 at 15:31, Ferdy Galema <[EMAIL PROTECTED]>wrote:

> As far as I can tell there is not yet a build-in mechanism you can use for
> this. You could implement your own InputFormat, something like
> MultiTableInputFormat. If you need different map functions for the two
> tables, perhaps something similar to Hadoop's MultipleInputs should do the
> trick.
>
>
> On 05/31/2011 02:06 PM, Eran Kutner wrote:
>
>> Hi,
>> I need to join two HBase tables. The obvious way is to use a M/R job for
>> that. The problem is that the few references to that question I found
>> recommend pulling one table to the mapper and then do a lookup for the
>> referred row in the second table.
>> This sounds like a very inefficient way to do  join with map reduce. I
>> believe it would be much better to feed the rows of both tables to the
>> mapper and let it emit a key based on the join fields. Since all the rows
>> with the same join fields values will have the same key the reducer will
>> be
>> able to easily generate the result of the join.
>> The problem with this is that I couldn't find a way to feed two tables to
>> a
>> single map reduce job. I could probably dump the tables to files in a
>> single
>> directory and then run the join on the files but that really makes no
>> sense.
>>
>> Am I missing something? Any other ideas?
>>
>> -eran
>>
>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB