Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - How to design a data warehouse in HBase?


+
bigdata 2012-12-13, 05:57
+
lars hofhansl 2012-12-13, 07:09
+
Michel Segel 2012-12-13, 08:43
+
bigdata 2012-12-13, 09:13
+
Mohammad Tariq 2012-12-13, 09:42
+
bigdata 2012-12-13, 09:47
+
Mohammad Tariq 2012-12-13, 10:13
+
bigdata 2012-12-13, 14:28
+
Mohammad Tariq 2012-12-13, 14:44
+
Kevin Odell 2012-12-13, 14:47
+
Mohammad Tariq 2012-12-13, 15:06
+
Kevin Odell 2012-12-13, 15:30
+
Mohammad Tariq 2012-12-13, 15:33
+
Manoj Babu 2012-12-13, 16:38
+
Kevin Odell 2012-12-13, 16:42
+
Michel Segel 2012-12-14, 00:49
+
Michael Segel 2012-12-13, 20:20
Copy link to this message
-
Re: How to design a data warehouse in HBase?
Asaf Mesika 2012-12-15, 02:14
Here's my take on this matter:

In the current situation, there isn't any good solution to the data warehousing solution you want in large scale. Impala and Drill are both projects that heads in this direction, but they still have a way to go and are not production ready yet. If you can stay at MySQL for moment, than stay there, or go for Hive but prepare a very large cluster of computers to handle the load.

A normal data warehouse as you describe is composed of DIMS (dimensions) and FACT tables. Representing this as is in HBase is a mess, since this will require you to do joins across the clusters - i.e. RPC calls and lots of them between Region Servers - which will slow down your queries to a halt (unless you want your user to wait 10-15 minutes).

The more sane approach then is do normalize the data - i.e. have a table containing the attributes of all dimensions in the FACT table, as one big fat FACT table - and save it to HDFS or HBase. Both have a partition key - your primary key to query upon (e.g. timestamp-customerId, timestamp-deviceId). You can query the data, after you filter it by the partition key, thus scanning only a portion of it, and then on each datanode/RS, filtering by the dimensions attributes as required by your query. If your data is distributed evenly across your cluster, running this query on multiple nodes at the same time can overcome the downside of fully reading the files/rows belonging to the partition key. You can add the statistical functions you require, such as sum,count, and send the rolled up results thus saving bandwidth.

The problem in current software stacks is that there's none that actually does what is stated above. Impala is in the right direction, but its yet to be in production state, from what I've read. Drill is just starting. Thus you end having to write map reduce jobs, which does the described above solution by either employing HIVE to get the HDFS files stored by partition key and translating you query into MR job, or using other open source solutions such as Cascading to ease the burden of writing your own MR Job code.

So in summary, I would stay at Oracle/MySQL until a descent open source answering your need will arrive - which I guess will happen during 2013/2014. If you can't - you will be forced to write your own custom solution, tailored to your queries, based on MR job. You can take a look at Trecul (https://github.com/akamai-tech/trecul) to boost processing speed of your Map Reduce job.

Asaf
On 13 בדצמ 2012, at 07:57, bigdata <[EMAIL PROTECTED]> wrote:

> Dear all,
> We have a traditional star-model data warehouse in RDBMS, now we want to transfer it to HBase. After study HBase, I learn that HBase is normally can be query by rowkey.
> 1.full rowkey (fastest)2.rowkey filter (fast)3.column family/qualifier filter (slow)
> How can I design the HBase tables to implement the warehouse functions, like:1.Query by DimensionA2.Query by DimensionA and DimensionB3.Sum, count, distinct ...
> From my opinion, I should create several HBase tables with all combinations of different dimensions as the rowkey. This solution will lead to huge data duplication. Is there any good suggestions to solve it?
> Thanks a lot!
>
>