Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> How to efficiently join HBase tables?


Copy link to this message
-
Re: How to efficiently join HBase tables?
For my need I don't really need the general case, but even if I did I think
it can probably be done simpler.
The main problem is getting the data from both tables into the same MR job,
without resorting to lookups. So without the theoretical
MutliTableInputFormat, I could just copy all the data from both tables into
a temp table, just append the source table name to the row keys to make sure
there are no conflicts. When all the data from both tables is in the same
temp table, run a MR job. For each row the mapper should emit a key which is
composed of all the values of the join fields in that row (the value can be
emitted as is). This will cause all the rows from both tables, with same
join field values to arrive at the reducer together. The reducer could then
iterate over them and produce the Cartesian product as needed.

I still don't like having to copy all the data into a temp table just
because I can't feed two tables into the MR job.

As Jason Rutherglen mentioned above, Hive can do joins. I don't know if it
can do them for HBase and it will not suit my needs, but it would be
interesting to know how is it doing them, if anyone knows.

-eran

On Tue, May 31, 2011 at 22:02, Ted Dunning <[EMAIL PROTECTED]> wrote:

> The Cartesian product often makes an honest-to-god join not such a good
> idea
> on large data.  The common alternative is co-group
> which is basically like doing the hard work of the join, but involves
> stopping just before emitting the cartesian product.  This allows
> you to inject whatever cleverness you need at this point.
>
> Common kinds of cleverness include down-sampling of problematically large
> sets of candidates.
>
> On Tue, May 31, 2011 at 11:56 AM, Michael Segel
> <[EMAIL PROTECTED]>wrote:
>
> > So the underlying problem that the OP was trying to solve was how to join
> > two tables from HBase.
> > Unfortunately I goofed.
> > I gave a quick and dirty solution that is a bit incomplete. They row key
> in
> > the temp table has to be unique and I forgot about the Cartesian
> > product. So my solution wouldn't work in the general case.
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB