Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> How to efficiently join HBase tables?


Copy link to this message
-
Re: How to efficiently join HBase tables?
Thanks everyone for the great feedback. I'll try to address all the
suggestions.

My data sets go between large and very large. One is in the order of many
billions of rows, although the input for a typical MR job will be in the
hundreds of millions, the second table is in the tens of millions. I doubt a
SQL DB will handle this kind of a join in a reasonable manner.

Doing batched lookup will indeed be more efficient than one by one but it
will require the mapper to manage a local state between multiple calls,
which is something I don't really like doing, and worse, it doesn't really
solve the lookup problem it only moves it one tier lower. Instead of the
mapper having to do all thouse random lookups, now HBase itself will have to
do them. Granted it is more efficient than individual lookup API calls but
it is not nearly as efficient as doing sequential reads.

Finally, the temp table method, that will work but again, I suspect it will
be a lot less efficient than the sequence files Hadoop would generate. The
join output is expected to be in the tens of millions of rows, each with
multiple columns. From some tests I've done, writing this number of rows to
a clean table starts out very slowly and takes a lot of time to ramp up as
the regions begin to split and move around the cluster. I should say that
the output of this join is just the input for another MR job, so it would
really be just a temp table and not something that would be useful after
that.

I should also say that I have looked into eliminating the lookup altogether
by resolving the data from the second table before the rows are inserted to
the main table, kind of denormalization, but that would introduce an
unacceptable latency to a very high volume process.

Still looking for other ideas.

-eran

On Tue, May 31, 2011 at 18:42, Doug Meil <[EMAIL PROTECTED]>wrote:

> Eran's observation was that a join is solvable in a Mapper via lookups on a
> 2nd HBase table, but it might not be that efficient if the lookups are 1 by
> 1.  I agree with that.
>
> My suggestion was to use multi-Get for the lookups instead.  So you'd hold
> onto a batch of records in the Mapper and then the batch size is filled,
> then you do the lookups (and then any required emitting, etc.).
>
>
>
> -----Original Message-----
> From: Michael Segel [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, May 31, 2011 10:56 AM
> To: [EMAIL PROTECTED]
> Subject: RE: How to efficiently join HBase tables?
>
>
> Maybe I'm missing something... but this isn't a hard problem to solve.
>
> Eran wants to join two tables.
> If we look at an SQL Statement...
>
> SELECT A.*, B.*
> FROM A, B
> WHERE A.1 = B.1
> AND  A.2 = B.2
> AND  A.3 = xxx
> AND A.4 = yyy
> AND B.45 = zzz
>
> Or something along those lines.
>
> So what you're essentially doing is saying I want to take a subset of data
> from table A, and a subset of data from table B and join them on the values
> in columns 1 and 2.
> Table A's data will be filtered on columns 3 and 4 and B's data will be
> filtered on column 45. NOTE: since you don't know the relationship of the
> column names to either table, you're safer in writing tableA|column_name and
> tableB|column_name to your temp table.
>
> So if you create a temp table FOO where the key is column 1 and column 2
> (column1|column2) then when you walk through the subsets adding them to the
> temp table, you will get the end result automatically.
>
> Then you can output your hbase temp table and then truncate the table.
>
> So what am I missing?
>
> -Mike
>
>
> > From: [EMAIL PROTECTED]
> > To: [EMAIL PROTECTED]
> > Date: Tue, 31 May 2011 10:22:34 -0400
> > Subject: RE: How to efficiently join HBase tables?
> >
> >
> > Re:  "The problem is that the few references to that question I found
> recommend pulling one table to the mapper and then do a lookup for the
> referred row in the second table."
> >
> > With multi-get in .90.x you could perform some reasonably clever
> processing and not do the lookups one-by-one but in batches.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB