|
|
-
Implementing a star schema (facts & dimension model)
Austin Chungath 2012-10-22, 11:17
Hi,
I am new to data warehousing in hadoop. This might be a trivial question but I was unable to find any answers in the mailing list. My questions are: A person has an existing data warehouse that uses a star schema (implemented in a mysql database).How to migrate it to Hadoop? I can use sqoop to copy my tables to hive, that much I know.
But what happens to referential integrity? since there are no primary key / foreign key concepts. I have seen that I can use Hive & Hbase together. Is there a method for storing facts and dimension tables in hadoop using Hive & Hbase together? Does putting dimensions in Hbase & facts in Hive make any sense? or should it be the other way around?
Consider de-normalization is not an option. What is the best practice to port an existing data warehouse to hadoop, with minimum changes to the database model?
Please let me know with whatever views you have on this.
Thanks, Austin
-
Re: Implementing a star schema (facts & dimension model)
Bejoy KS 2012-10-22, 11:29
Hi Austin
You can import the existing tables to hive as such using sqoop. Hive is a wrapper over mapreduce that gives you the flexibility to create optimized mapreduce jobs using SQL like syntax. The is no relational style maintained in hive and don't treat hive as a typical database/datawarehouse. In short no referencial integrity. However if you have all the data in hive almost all queries that work on mysql would work in hive as well. Some queries may not but still you'll have work arounds.
You can have whole data sets/tables in hive itself. You don't need denormalization much I guess. Joins work well in hive.
Regards Bejoy KS
Sent from handheld, please excuse typos.
-----Original Message----- From: Austin Chungath <[EMAIL PROTECTED]> Date: Mon, 22 Oct 2012 16:47:04 To: <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: Implementing a star schema (facts & dimension model)
Hi,
I am new to data warehousing in hadoop. This might be a trivial question but I was unable to find any answers in the mailing list. My questions are: A person has an existing data warehouse that uses a star schema (implemented in a mysql database).How to migrate it to Hadoop? I can use sqoop to copy my tables to hive, that much I know.
But what happens to referential integrity? since there are no primary key / foreign key concepts. I have seen that I can use Hive & Hbase together. Is there a method for storing facts and dimension tables in hadoop using Hive & Hbase together? Does putting dimensions in Hbase & facts in Hive make any sense? or should it be the other way around?
Consider de-normalization is not an option. What is the best practice to port an existing data warehouse to hadoop, with minimum changes to the database model?
Please let me know with whatever views you have on this.
Thanks, Austin
-
Re: Implementing a star schema (facts & dimension model)
Manish Bhoge 2012-10-22, 11:34
Austin,
There are some of the great questions asked simply in your email.
Datawarehouse and hadoop echo system goes hand-on-hand. I don't think you need to move all data from your warehouse to hive and hbase. This is the key :) you need to understand where should you use have and where can you utilize hbase for your existing datawarehouse. There is no reason for which you can just move everything from current state of datawarehouse to hadoop. Anyway, I can tell you some of the learning here, if you have data which batch loaded and has huge volume(more than 1 TB) and you can flatten the structure to a table then use Hive it will give you flexibility to slide and dice the data between hive table. Remember, it should be batch updated or never updated from the source (dimensional data basically). However all the frequent transaction can be routed through Hbase. Hbase gives you better update mechanism than Hive. If you're working on web data(crawler or logs) then Pig can give you quick utilization to bring on that data to analysis or creating reporting dataset for further analysis.
Remember, the way you design the data structure or model in datawarehouse world is all together different than the way you define in hadoop echo system.
I would iterate first find out what data you need to move to hadoop echo system on what reason? Believe me MySql can give you far quicker response . Don't use hadoop to replace your MySQL datawarehouse.
Thank You, Manish.
Sent from my BlackBerry, pls excuse typo
-----Original Message----- From: Austin Chungath <[EMAIL PROTECTED]> Date: Mon, 22 Oct 2012 16:47:04 To: <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: Implementing a star schema (facts & dimension model)
Hi,
I am new to data warehousing in hadoop. This might be a trivial question but I was unable to find any answers in the mailing list. My questions are: A person has an existing data warehouse that uses a star schema (implemented in a mysql database).How to migrate it to Hadoop? I can use sqoop to copy my tables to hive, that much I know.
But what happens to referential integrity? since there are no primary key / foreign key concepts. I have seen that I can use Hive & Hbase together. Is there a method for storing facts and dimension tables in hadoop using Hive & Hbase together? Does putting dimensions in Hbase & facts in Hive make any sense? or should it be the other way around?
Consider de-normalization is not an option. What is the best practice to port an existing data warehouse to hadoop, with minimum changes to the database model?
Please let me know with whatever views you have on this.
Thanks, Austin
-
Re: Implementing a star schema (facts & dimension model)
Shreepadma Venugopalan 2012-10-22, 18:01
Hi Austin,
Bejoy summarized it well. Even though Hive doesn't have referencial integrity constraints, if you are loading data from MySQL that has referencial integrity constraint, Hive preserves that property of the data. Another thing to note is you can't update the data that is loaded into a Hive warehouse. So once you migrate the data from MySQL, the referencial integrity constraints will be preserved in the Hive warehouse. Furthermore, the fact that Hive doesn't have referencial integrity constraints shouldn't prevent you from running any of the queries that you run on MySQL . However, some joins that are typically eliminated in an RDBMS because of referencial integrity constraints are not eliminated.
Thanks, Shreepadma
On Mon, Oct 22, 2012 at 4:29 AM, Bejoy KS <[EMAIL PROTECTED]> wrote:
> ** > Hi Austin > > You can import the existing tables to hive as such using sqoop. Hive is a > wrapper over mapreduce that gives you the flexibility to create optimized > mapreduce jobs using SQL like syntax. The is no relational style maintained > in hive and don't treat hive as a typical database/datawarehouse. In short > no referencial integrity. > > > However if you have all the data in hive almost all queries that work on > mysql would work in hive as well. Some queries may not but still you'll > have work arounds. > > You can have whole data sets/tables in hive itself. > > You don't need denormalization much I guess. Joins work well in hive. > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > ------------------------------ > *From: * Austin Chungath <[EMAIL PROTECTED]> > *Date: *Mon, 22 Oct 2012 16:47:04 +0530 > *To: *<[EMAIL PROTECTED]> > *ReplyTo: * [EMAIL PROTECTED] > *Subject: *Implementing a star schema (facts & dimension model) > > Hi, > > I am new to data warehousing in hadoop. This might be a trivial question > but I was unable to find any answers in the mailing list. > My questions are: > A person has an existing data warehouse that uses a star schema > (implemented in a mysql database).How to migrate it to Hadoop? > I can use sqoop to copy my tables to hive, that much I know. > > But what happens to referential integrity? since there are no primary key > / foreign key concepts. > I have seen that I can use Hive & Hbase together. Is there a method for > storing facts and dimension tables in hadoop using Hive & Hbase together? > Does putting dimensions in Hbase & facts in Hive make any sense? or should > it be the other way around? > > Consider de-normalization is not an option. > What is the best practice to port an existing data warehouse to hadoop, > with minimum changes to the database model? > > Please let me know with whatever views you have on this. > > Thanks, > Austin > >
|
|