|
|
-
Dimensional Data Model on Hive
Kuldeep Chitrakar 2012-05-10, 13:26
Hi
I have data warehouse implementation for Click Stream data analysis on RDBMS. Its a start schema (Dimensions and Facts).
Now if i want to move to Hive, Do i need to create same data model as Dimensions and facts and join them.
I should create a big de-normalized table which contains all textual attributes from all dimensions. If so how do we handle SCD 2 type dimensions in Hive.
Its very basic question but I am just confused on this. Thanks, Kuldeep
-
Re: Dimensional Data Model on Hive
Edward Capriolo 2012-05-10, 13:53
On Thu, May 10, 2012 at 9:26 AM, Kuldeep Chitrakar <[EMAIL PROTECTED]> wrote: > Hi > > > > I have data warehouse implementation for Click Stream data analysis on > RDBMS. Its a start schema (Dimensions and Facts). > > > > Now if i want to move to Hive, Do i need to create same data model as > Dimensions and facts and join them. > > > > I should create a big de-normalized table which contains all textual > attributes from all dimensions. If so how do we handle SCD 2 type dimensions > in Hive. > > > > Its very basic question but I am just confused on this. > > > > > > Thanks, > > Kuldeep
While hive is sometimes referred to as a data warehouse you usually want to avoid data warehouse concepts like stat-schema. There are a number of reasons for this: 1) No unique constraints 2) limited index capabilities 3) Map side joins are optimal when a single table is small 4) Most join types while generalize into map reduce are much different then a join in single node databases
I'm most situations I advice going the "nosql route" and de-normalize almost everything. Optimize for scanning.
-
RE: Dimensional Data Model on Hive
Kuldeep Chitrakar 2012-05-10, 14:16
Does that mean all data in one BigTable in de-normalized form? Then whats the main benefit of using Hive against Hbase as Hbase also recommends Highly de normalized BigTable. Thanks, Kuldeep -----Original Message----- From: Edward Capriolo [mailto:[EMAIL PROTECTED]] Sent: 10 May 2012 19:24 To: [EMAIL PROTECTED] Subject: Re: Dimensional Data Model on Hive
On Thu, May 10, 2012 at 9:26 AM, Kuldeep Chitrakar <[EMAIL PROTECTED]> wrote: > Hi > > > > I have data warehouse implementation for Click Stream data analysis on > RDBMS. Its a start schema (Dimensions and Facts). > > > > Now if i want to move to Hive, Do i need to create same data model as > Dimensions and facts and join them. > > > > I should create a big de-normalized table which contains all textual > attributes from all dimensions. If so how do we handle SCD 2 type dimensions > in Hive. > > > > Its very basic question but I am just confused on this. > > > > > > Thanks, > > Kuldeep
While hive is sometimes referred to as a data warehouse you usually want to avoid data warehouse concepts like stat-schema. There are a number of reasons for this: 1) No unique constraints 2) limited index capabilities 3) Map side joins are optimal when a single table is small 4) Most join types while generalize into map reduce are much different then a join in single node databases
I'm most situations I advice going the "nosql route" and de-normalize almost everything. Optimize for scanning.
-
Re: Dimensional Data Model on Hive
Edward Capriolo 2012-05-10, 14:24
On Thu, May 10, 2012 at 10:16 AM, Kuldeep Chitrakar <[EMAIL PROTECTED]> wrote: > Does that mean all data in one BigTable in de-normalized form? Then whats the main benefit of using Hive against Hbase as Hbase also recommends Highly de normalized BigTable. > > > Thanks, > Kuldeep > -----Original Message----- > From: Edward Capriolo [mailto:[EMAIL PROTECTED]] > Sent: 10 May 2012 19:24 > To: [EMAIL PROTECTED] > Subject: Re: Dimensional Data Model on Hive > > On Thu, May 10, 2012 at 9:26 AM, Kuldeep Chitrakar > <[EMAIL PROTECTED]> wrote: >> Hi >> >> >> >> I have data warehouse implementation for Click Stream data analysis on >> RDBMS. Its a start schema (Dimensions and Facts). >> >> >> >> Now if i want to move to Hive, Do i need to create same data model as >> Dimensions and facts and join them. >> >> >> >> I should create a big de-normalized table which contains all textual >> attributes from all dimensions. If so how do we handle SCD 2 type dimensions >> in Hive. >> >> >> >> Its very basic question but I am just confused on this. >> >> >> >> >> >> Thanks, >> >> Kuldeep > > While hive is sometimes referred to as a data warehouse you usually > want to avoid data warehouse concepts like stat-schema. There are a > number of reasons for this: > 1) No unique constraints > 2) limited index capabilities > 3) Map side joins are optimal when a single table is small > 4) Most join types while generalize into map reduce are much different > then a join in single node databases > > I'm most situations I advice going the "nosql route" and de-normalize > almost everything. Optimize for scanning.
Q: Does that mean all data in one BigTable in de-normalized form? A: No. I qualified this by saying "most". I am not advocating one large table, every situation is different. But generally star schema is going to be very difficult to implement and have less benefits then it would in most RDBMS systems.
Q: What is the main benefit of using hive against hbase? A: I am not sure what you mean by "against". If you mean why would i chose one and not the other, hbase is designed for low latency < 20 ms put, get and scan operations. Hive is a declarative SQL like language that "queries" multi GB or TB sized files in hadoop. There is a storage handler implementation that allows you to query hbase data from hive as well if that is what you mean by against.
-
Re: Dimensional Data Model on Hive
Ashish Thusoo 2012-05-10, 15:16
Also of most of the things that you will be doing is full scans as opposed to needle in haystack queries there is usually no point in paying the overhead of running hbase region servers. Only if your data is heavily accessed by a key is the overhead of hbase justified. Another case could be when parts of your data are updated heavily again by a predominant key.
Ashish On May 10, 2012 10:25 AM, "Edward Capriolo" <[EMAIL PROTECTED]> wrote:
> On Thu, May 10, 2012 at 10:16 AM, Kuldeep Chitrakar > <[EMAIL PROTECTED]> wrote: > > Does that mean all data in one BigTable in de-normalized form? Then > whats the main benefit of using Hive against Hbase as Hbase also recommends > Highly de normalized BigTable. > > > > > > Thanks, > > Kuldeep > > -----Original Message----- > > From: Edward Capriolo [mailto:[EMAIL PROTECTED]] > > Sent: 10 May 2012 19:24 > > To: [EMAIL PROTECTED] > > Subject: Re: Dimensional Data Model on Hive > > > > On Thu, May 10, 2012 at 9:26 AM, Kuldeep Chitrakar > > <[EMAIL PROTECTED]> wrote: > >> Hi > >> > >> > >> > >> I have data warehouse implementation for Click Stream data analysis on > >> RDBMS. Its a start schema (Dimensions and Facts). > >> > >> > >> > >> Now if i want to move to Hive, Do i need to create same data model as > >> Dimensions and facts and join them. > >> > >> > >> > >> I should create a big de-normalized table which contains all textual > >> attributes from all dimensions. If so how do we handle SCD 2 type > dimensions > >> in Hive. > >> > >> > >> > >> Its very basic question but I am just confused on this. > >> > >> > >> > >> > >> > >> Thanks, > >> > >> Kuldeep > > > > While hive is sometimes referred to as a data warehouse you usually > > want to avoid data warehouse concepts like stat-schema. There are a > > number of reasons for this: > > 1) No unique constraints > > 2) limited index capabilities > > 3) Map side joins are optimal when a single table is small > > 4) Most join types while generalize into map reduce are much different > > then a join in single node databases > > > > I'm most situations I advice going the "nosql route" and de-normalize > > almost everything. Optimize for scanning. > > Q: Does that mean all data in one BigTable in de-normalized form? > A: No. I qualified this by saying "most". I am not advocating one > large table, every situation is different. But generally star schema > is going to be very difficult to implement and have less benefits then > it would in most RDBMS systems. > > Q: What is the main benefit of using hive against hbase? > A: I am not sure what you mean by "against". If you mean why would i > chose one and not the other, hbase is designed for low latency < 20 ms > put, get and scan operations. Hive is a declarative SQL like language > that "queries" multi GB or TB sized files in hadoop. There is a > storage handler implementation that allows you to query hbase data > from hive as well if that is what you mean by against. >
-
Re: Dimensional Data Model on Hive
Jagat 2012-05-10, 15:17
Hello
Try to keep set of records which you need for particular analysis in same table. Generally we use Pig to feed data to hive tables and we have arranged our tables such that all the data which is to required for particular report is right present in that table. This helps to improve hive performance. While designing your schema Partition , index your tables depending on your queries.
Fact and dimensions concept should not be taken too seriously here. On Thu, May 10, 2012 at 6:56 PM, Kuldeep Chitrakar < [EMAIL PROTECTED]> wrote:
> Hi **** > > ** ** > > I have data warehouse implementation for Click Stream data analysis on > RDBMS. Its a start schema (Dimensions and Facts).**** > > ** ** > > Now if i want to move to Hive, Do i need to create same data model as > Dimensions and facts and join them. **** > > ** ** > > I should create a big de-normalized table which contains all textual > attributes from all dimensions. If so how do we handle SCD 2 type > dimensions in Hive.**** > > ** ** > > Its very basic question but I am just confused on this.**** > > ** ** > > ** ** > > Thanks,**** > > Kuldeep**** >
-
Re: Dimensional Data Model on Hive
Justin Coffey 2012-05-10, 19:29
Hello, My thoughts are rather straightforward: it is best not to think of hive as a data warehouse at all. period.
It is better to think of it as SQL to MapReduce translation layer with some meta data to help guide the process.
With this in mind, and if you really have lots of data, what you want to do is denormalize everything to avoid any and all joins (even map side joins if they have to happen on each record are costly).
Remember, you're not (really) querying indexed data (kinda not true, but mostly valid). You're querying distributed log files.
-Justin
On Thu, May 10, 2012 at 5:17 PM, Jagat <[EMAIL PROTECTED]> wrote:
> Hello > > Try to keep set of records which you need for particular analysis in same > table. Generally we use Pig to feed data to hive tables and we have > arranged our tables such that all the data which is to required for > particular report is right present in that table. This helps to improve > hive performance. While designing your schema Partition , index your tables > depending on your queries. > > Fact and dimensions concept should not be taken too seriously here. > > > > On Thu, May 10, 2012 at 6:56 PM, Kuldeep Chitrakar < > [EMAIL PROTECTED]> wrote: > >> Hi **** >> >> ** ** >> >> I have data warehouse implementation for Click Stream data analysis on >> RDBMS. Its a start schema (Dimensions and Facts).**** >> >> ** ** >> >> Now if i want to move to Hive, Do i need to create same data model as >> Dimensions and facts and join them. **** >> >> ** ** >> >> I should create a big de-normalized table which contains all textual >> attributes from all dimensions. If so how do we handle SCD 2 type >> dimensions in Hive.**** >> >> ** ** >> >> Its very basic question but I am just confused on this.**** >> >> ** ** >> >> ** ** >> >> Thanks,**** >> >> Kuldeep**** >> > > -- [EMAIL PROTECTED] -----
|
|