Uhmm... not exactly. It depends on how you view HBase and your use case...
The short answer is that Sudheendra is basically correct, you really need to rethink using HBase if you're doing a lot of joins because HBase is more of a persistent object store and not a relational database. The longer answer is that even though HBase lacks the internals to handle JOINS effectively, it can be made to do joins.
Ok.... you have to remember that JOINS are expensive. If you don't have indexes, its going to be a map/reduce problem.
If you have indexes... you can join against them by comparing the ordered sets and taking the intersections. Using inverted tables and then a FK idx table.
There are some issues that you have to work around...
1) A row can't exceed the size of a region. So you will need to work out how to split a row while still maintaining sort order.
2) You will probably want to launch the query from an Edge node (some call it a gateway node) which is on the same subnet as your cluster.
3) Such a solution is going to work when you want fast read , but a slower write.
4) Coprocessors need to be tweaked a bit and you would want to decouple the writes to the secondary index tables from the base table write.
5) If you rely on your Hadoop Vendor to auto tune your cluster... you will have to make some manual tweaks.
6) This is not for the beginner or faint of heart.
But yes, in a nutshell, it can be done.
Also a side note. If you want to use Lucene as your secondary index, you could do it... but haven't thought through that problem yet...
On a different side note:
This is why the current model of indexes may work ok for limiting results against a single table... it won't work well against tables when you want to do joins.
(And you will want to do joins in HBase eventually....)
On Aug 19, 2013, at 9:11 AM, Shahab Yunus <[EMAIL PROTECTED]> wrote:
> I think you should not try to join the tables this way. It will be against
> the recommended design/pattern of HBase (joins in HBase alone go against
> the design) and M/R. You should first, maybe through another M/R job or PIg
> script, for example, pre-process data and massage it into a uniform or
> appropriate structure conforming to the M/R architecture (maybe convert
> them into ext files first?) Have you looked into the recommended M/R join
> Some links to start with:
> On Mon, Aug 19, 2013 at 9:43 AM, Pavan Sudheendra <[EMAIL PROTECTED]>wrote:
>> I'm basically trying to do a join across 3 tables in the mapper.. In the
>> reducer i am doing a group by and writing the output to another table..
>> Although, i agree that my code is pathetic, what i could actually do is
>> create a HTable object once and pass it as an extra argument to the map
>> function.. But, would that solve the problem?
>> Roughly these are my tables and the code flows like this
>> Mapper-> Table1 -> Contentidx ->Content -> Mapper aggregates the values ->
>> Table1 -> 19 Million rows.
>> Contentidx table - 150k rows.
>> Content table - 93k rows.
>> Yes, i have looked at the map-reduce example given by the hbase website and
>> that is how i am following.
>> On Mon, Aug 19, 2013 at 7:05 PM, Shahab Yunus <[EMAIL PROTECTED]
>>> Can you please explain or show the flow of the code a bit more? Why are
>>> create the HTable object again and again in the mapper? Where is
>>> (the name of the table, I believe?) defined? What is your actually
>>> Also, have you looked into this, the api for wiring HBase tables with M/R
The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental.
Use at your own risk.
michael_segel (AT) hotmail.com