Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Getting dimension values for Facts


Copy link to this message
-
Re: Getting dimension values for Facts
It seems like your fact table and its corresponding dimension tables follow
a traditional data warehousing star topology relational diagram.  I would
have to ask what is the purpose/deliverable of coalescing your star
topology into a 1NF of fields in Pig?  In this case, I'd agree with
Pradeep; that the use of MR for queries against the warehouse using Hive
might be more appropriate.  If you really do need the coalesced 1NF of your
warehouse, I'd recommend a parent bash/perl/php script to dynamically
generate a pig script, with your table names and paths as parameter inputs
to the script.

-Dan
On Fri, Jul 19, 2013 at 10:10 AM, Sajid Raza <[EMAIL PROTECTED]> wrote:

> Could you potentially store all of your facts in one table, join it agains
> your dimension table, and filter as needed?
>
> On Jul 18, 2013, at 10:51 AM, Pradeep Gollakota <[EMAIL PROTECTED]>
> wrote:
>
> > Unfortunately I can't think of any good way of doing this (other than
> what
> > Bertrand suggested with using a different language to generate the
> script).
> >
> > I'd also recommend Hive... it may be easier to do this in Hive since you
> > have SQL like syntax. (Haven't used Hive, but it looks like this type of
> > thing would be far more natural in Hive)
> >
> >
> > On Thu, Jul 18, 2013 at 12:09 PM, Something Something <
> > [EMAIL PROTECTED]> wrote:
> >
> >> I don't think this is macro-able, Pradeep.  Every step of the way a
> >> different column gets updated.  For example, for FACT_TABLE3 we update
> >> 'col1' from DIMENSION1, for FACT_TABLE5 we update 'col2' from
> DIMENSION2 &
> >> so on.
> >>
> >> Feel free to correct me if I am wrong.  Thanks.
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Jul 18, 2013 at 8:25 AM, Pradeep Gollakota <
> [EMAIL PROTECTED]
> >>> wrote:
> >>
> >>> Looks like this might be macroable. Not entirely sure how that can be
> >> done
> >>> yet... but I'd look into that if I were you.
> >>>
> >>>
> >>> On Thu, Jul 18, 2013 at 11:16 AM, Something Something <
> >>> [EMAIL PROTECTED]> wrote:
> >>>
> >>>> Wow, Bertrand, on the Pig mailing list you're recommending not to use
> >>>> Pig... LOL!  Jokes apart, I would think this would be a common use
> case
> >>> for
> >>>> Pig, no?  Generating a Pig script on the fly is a decent idea, but
> >> we're
> >>>> hoping to avoid that - unless there's no other way.  Thanks for the
> >>>> pointers.
> >>>>
> >>>>
> >>>> On Thu, Jul 18, 2013 at 2:52 AM, Bertrand Dechoux <[EMAIL PROTECTED]
> >>>>> wrote:
> >>>>
> >>>>> I would say either generate the script using another language (eg
> >>> Python)
> >>>>> or use a true programming language with an API having the same level
> >> of
> >>>>> abstraction (eg Java and Cascading).
> >>>>>
> >>>>> Bertrand
> >>>>>
> >>>>>
> >>>>> On Thu, Jul 18, 2013 at 8:44 AM, Something Something <
> >>>>> [EMAIL PROTECTED]> wrote:
> >>>>>
> >>>>>> There must be a better way to do this in Pig.  Here's how my script
> >>>> looks
> >>>>>> like right now:  (omitted some snippet for saving space, but you
> >> will
> >>>> get
> >>>>>> the idea).
> >>>>>>
> >>>>>> FACT_TABLE = LOAD 'XYZ'  as (col1 :chararray,………. col30:
> >> chararray);
> >>>>>>
> >>>>>> FACT_TABLE1  = FOREACH FACT_TABLE GENERATE col1, udf1(col2) as
> >>> col2,…..
> >>>>>> udf10(col30) as col30;
> >>>>>>
> >>>>>> DIMENSION1 = LOAD 'DIM1' as (key, value);
> >>>>>>
> >>>>>> FACT_TABLE2 = JOIN FACT_TABLE1 BY col1 LEFT OUTER, DIMENSION1 BY
> >> key;
> >>>>>>
> >>>>>> FACT_TABLE3  = FOREACH FACT_TABLE2 GENERATE DIMENSION1::value as
> >>>> col1,…….
> >>>>>> FACT_TABLE1::col30 as col30;
> >>>>>>
> >>>>>> DIMENSION2 = LOAD 'DIM2' as (key, value);
> >>>>>>
> >>>>>> FACT_TABLE4 = JOIN FACT_TABLE3 BY col2 LEFT OUTER, DIMENSION2 BY
> >> key;
> >>>>>>
> >>>>>> FACT_TABLE5  = FOREACH FACT_TABLE4 GENERATE  FACT_TABLE3::col1 as
> >>>>>> col1, DIMENSION2::value as col2,…….  FACT_TABLE3::col30 as col30;
> >>>>>>
> >>>>>> & so on!  There are 10 more such dimension tables to join.
> >>>>>>
> >>>>>> In short, each row on the fact table needs to be joined to a key