Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - How to efficiently join HBase tables?


Copy link to this message
-
Re: How to efficiently join HBase tables?
Eran Kutner 2011-06-03, 07:23
Mike, this more or less what I tried to  describe in my initial post, only
you explained it much better.
The problem is that I want to do all of this in one M/R run, not 3 and
without explicit temp tables. If there was only a way to feed both table A
and table B into the M/R job then it could be done.

Let's take your query and assumptions, for example.
So we configure scanner A to return rows where c=xxx and d=yyy
We then configure scanner B to return rows where e=zzz
Now we feed all those rows to the mapper.
For each row the mapper gets it outputs a new key which is "a|b" and the
same value it received, if either one doesn't exist in the row the mapper
doesn't output anything for that row.
The is an implicit "temp table" created at this stage by hadoop.
Now the reducer is run, for every key "a|b" generated by the mapper it would
get one or more value sets, each one representing a row from the original
two tables. For simplicity lets assume we got two rows, one from table A the
other from table B. Now the reducer can combine the two rows and output the
combined row. This will work just the same if there were multiple rows from
each table with the same "a|b" key, in that case the reducer would have to
generate the Cartesian product of all the rows. Outer joins can also be done
this way, in an outer join you only get one row in the reducer for a given
"a|b" key but still generate an output.

-eran

On Fri, Jun 3, 2011 at 00:05, Michael Segel <[EMAIL PROTECTED]>wrote:

>
> Not to beat a dead horse, but I thought a bit more about the problem.
> If you want to do this all in HBase using a M/R job...
>
> Lets define the following:
> SELECT *
> FROM A, B
> WHERE A.a = B.a
> AND     A.b = B.b
> AND     A.c = xxx
> AND     A.d = yyy
> AND     B.e = zzz
>
> Is the sample query.
>
> So our join key is "a|b" because we're matching on columns a and b. (The
> pipe is to delimit the columns, assuming the columns themselves don't
> contain pipes...)
>
> Our filters on A are c and d while e is the filter on B.
>
> So we want to do the following:
>
> M/R Map job 1 gets the subset from table A along with a set of unique keys.
> M/R Map job 2 gets the subset from table B along with a set of unique keys.
> M/R Map job 3 takes either set of unique keys as the input list and you
> split it based on the number of parallel mappers you want to use.
>
> You have a couple of options on how you want to proceed.
> In each Mapper.map() your input is a unique key.
> I guess you could create two scanners, one for tempTableA, and one for
> tempTableB.
> It looks like you can get the iterator for each result set, and then for
> each row in temp table A, you iterate through the result set from temp table
> B, writing out the joined set.
>
> The only problem is that your result set file isn't in sort order. So I
> guess you could take the output from this job and reduce it to get it in to
> sort order.
>
> Option B. Using HDFS files for temp 'tables'.
> You can do this... but you would still have to track the unique keys and
> also sort both the keys and the files which will require a reduce job.
>
>
> Now this is just my opinion, but if I use HBase, I don't have to worry
> about using a reducer except to order the final output set.
> So I can save the time it takes to do the reduce step. So I have to ask...
> how much time is spent by HBase in splitting and compacting the temp tables?
> Also can't you pre-split the temp table before you use them?
>
> Or am I still missing something?
>
> Note: In this example, you'd have to write an input format that takes a
> java list object (or something similar) as your input and then you can split
> it to get it to run in parallel.
> Or you could just write this on the client and split the list up and run
> the join in parallel threads on the client node. Or a single thread which
> would mean that it would run and output in sort order.
>
> HTH
>
> -Mike
>
> > Date: Wed, 1 Jun 2011 07:47:30 -0700
> > Subject: Re: How to efficiently join HBase tables?