Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Dimensional Data Model on Hive


Copy link to this message
-
Re: Dimensional Data Model on Hive
Ashish Thusoo 2012-05-10, 15:16
Also of most of the things that you will be doing is full scans as opposed
to needle in haystack queries there is usually no point in paying the
overhead of running hbase region servers. Only if your data is heavily
accessed by a key is the overhead of hbase justified. Another case could be
when parts of your data are updated heavily again by a predominant key.

Ashish
On May 10, 2012 10:25 AM, "Edward Capriolo" <[EMAIL PROTECTED]> wrote:

> On Thu, May 10, 2012 at 10:16 AM, Kuldeep Chitrakar
> <[EMAIL PROTECTED]> wrote:
> > Does that mean all data in one BigTable in de-normalized form? Then
> whats the main benefit of using Hive against Hbase as Hbase also recommends
> Highly de normalized BigTable.
> >
> >
> > Thanks,
> > Kuldeep
> > -----Original Message-----
> > From: Edward Capriolo [mailto:[EMAIL PROTECTED]]
> > Sent: 10 May 2012 19:24
> > To: [EMAIL PROTECTED]
> > Subject: Re: Dimensional Data Model on Hive
> >
> > On Thu, May 10, 2012 at 9:26 AM, Kuldeep Chitrakar
> > <[EMAIL PROTECTED]> wrote:
> >> Hi
> >>
> >>
> >>
> >> I have data warehouse implementation for Click Stream data analysis on
> >> RDBMS. Its a start schema (Dimensions and Facts).
> >>
> >>
> >>
> >> Now if i want to move to Hive, Do i need to create same data model as
> >> Dimensions and facts and join them.
> >>
> >>
> >>
> >> I should create a big de-normalized table which contains all textual
> >> attributes from all dimensions. If so how do we handle SCD 2 type
> dimensions
> >> in Hive.
> >>
> >>
> >>
> >> Its very basic question but I am just confused on this.
> >>
> >>
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Kuldeep
> >
> > While hive is sometimes referred to as a data warehouse you usually
> > want to avoid data warehouse concepts like stat-schema. There are a
> > number of reasons for this:
> > 1) No unique constraints
> > 2) limited index capabilities
> > 3) Map side joins are optimal when a single table is small
> > 4) Most join types while generalize into map reduce are much different
> > then a join in single node databases
> >
> > I'm most situations I advice going the "nosql route" and de-normalize
> > almost everything. Optimize for scanning.
>
> Q: Does that mean all data in one BigTable in de-normalized form?
> A: No. I qualified this by saying "most". I am not advocating one
> large table, every situation is different. But generally star schema
> is going to be very difficult to implement and have less benefits then
> it would in most RDBMS systems.
>
> Q: What is the main benefit of using hive against hbase?
> A: I am not sure what you mean by "against". If you mean why would i
> chose one and not the other, hbase is designed for low latency < 20 ms
> put, get and scan operations. Hive is a declarative SQL like language
> that "queries" multi GB or TB sized files in hadoop. There is a
> storage handler implementation that allows you to query hbase data
> from hive as well if that is what you mean by against.
>