Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Getting dimension values for Facts


+
Something Something 2013-07-18, 06:44
+
Bertrand Dechoux 2013-07-18, 09:52
+
Something Something 2013-07-18, 15:16
+
Pradeep Gollakota 2013-07-18, 15:25
+
Something Something 2013-07-18, 16:09
+
Pradeep Gollakota 2013-07-18, 17:51
+
Sajid Raza 2013-07-19, 14:10
Copy link to this message
-
Re: Getting dimension values for Facts
It seems like your fact table and its corresponding dimension tables follow
a traditional data warehousing star topology relational diagram.  I would
have to ask what is the purpose/deliverable of coalescing your star
topology into a 1NF of fields in Pig?  In this case, I'd agree with
Pradeep; that the use of MR for queries against the warehouse using Hive
might be more appropriate.  If you really do need the coalesced 1NF of your
warehouse, I'd recommend a parent bash/perl/php script to dynamically
generate a pig script, with your table names and paths as parameter inputs
to the script.

-Dan
On Fri, Jul 19, 2013 at 10:10 AM, Sajid Raza <[EMAIL PROTECTED]> wrote:

> Could you potentially store all of your facts in one table, join it agains
> your dimension table, and filter as needed?
>
> On Jul 18, 2013, at 10:51 AM, Pradeep Gollakota <[EMAIL PROTECTED]>
> wrote:
>
> > Unfortunately I can't think of any good way of doing this (other than
> what
> > Bertrand suggested with using a different language to generate the
> script).
> >
> > I'd also recommend Hive... it may be easier to do this in Hive since you
> > have SQL like syntax. (Haven't used Hive, but it looks like this type of
> > thing would be far more natural in Hive)
> >
> >
> > On Thu, Jul 18, 2013 at 12:09 PM, Something Something <
> > [EMAIL PROTECTED]> wrote:
> >
> >> I don't think this is macro-able, Pradeep.  Every step of the way a
> >> different column gets updated.  For example, for FACT_TABLE3 we update
> >> 'col1' from DIMENSION1, for FACT_TABLE5 we update 'col2' from
> DIMENSION2 &
> >> so on.
> >>
> >> Feel free to correct me if I am wrong.  Thanks.
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Jul 18, 2013 at 8:25 AM, Pradeep Gollakota <
> [EMAIL PROTECTED]
> >>> wrote:
> >>
> >>> Looks like this might be macroable. Not entirely sure how that can be
> >> done
> >>> yet... but I'd look into that if I were you.
> >>>
> >>>
> >>> On Thu, Jul 18, 2013 at 11:16 AM, Something Something <
> >>> [EMAIL PROTECTED]> wrote:
> >>>
> >>>> Wow, Bertrand, on the Pig mailing list you're recommending not to use
> >>>> Pig... LOL!  Jokes apart, I would think this would be a common use
> case
> >>> for
> >>>> Pig, no?  Generating a Pig script on the fly is a decent idea, but
> >> we're
> >>>> hoping to avoid that - unless there's no other way.  Thanks for the
> >>>> pointers.
> >>>>
> >>>>
> >>>> On Thu, Jul 18, 2013 at 2:52 AM, Bertrand Dechoux <[EMAIL PROTECTED]
> >>>>> wrote:
> >>>>
> >>>>> I would say either generate the script using another language (eg
> >>> Python)
> >>>>> or use a true programming language with an API having the same level
> >> of
> >>>>> abstraction (eg Java and Cascading).
> >>>>>
> >>>>> Bertrand
> >>>>>
> >>>>>
> >>>>> On Thu, Jul 18, 2013 at 8:44 AM, Something Something <
> >>>>> [EMAIL PROTECTED]> wrote:
> >>>>>
> >>>>>> There must be a better way to do this in Pig.  Here's how my script
> >>>> looks
> >>>>>> like right now:  (omitted some snippet for saving space, but you
> >> will
> >>>> get
> >>>>>> the idea).
> >>>>>>
> >>>>>> FACT_TABLE = LOAD 'XYZ'  as (col1 :chararray,………. col30:
> >> chararray);
> >>>>>>
> >>>>>> FACT_TABLE1  = FOREACH FACT_TABLE GENERATE col1, udf1(col2) as
> >>> col2,…..
> >>>>>> udf10(col30) as col30;
> >>>>>>
> >>>>>> DIMENSION1 = LOAD 'DIM1' as (key, value);
> >>>>>>
> >>>>>> FACT_TABLE2 = JOIN FACT_TABLE1 BY col1 LEFT OUTER, DIMENSION1 BY
> >> key;
> >>>>>>
> >>>>>> FACT_TABLE3  = FOREACH FACT_TABLE2 GENERATE DIMENSION1::value as
> >>>> col1,…….
> >>>>>> FACT_TABLE1::col30 as col30;
> >>>>>>
> >>>>>> DIMENSION2 = LOAD 'DIM2' as (key, value);
> >>>>>>
> >>>>>> FACT_TABLE4 = JOIN FACT_TABLE3 BY col2 LEFT OUTER, DIMENSION2 BY
> >> key;
> >>>>>>
> >>>>>> FACT_TABLE5  = FOREACH FACT_TABLE4 GENERATE  FACT_TABLE3::col1 as
> >>>>>> col1, DIMENSION2::value as col2,…….  FACT_TABLE3::col30 as col30;
> >>>>>>
> >>>>>> & so on!  There are 10 more such dimension tables to join.
> >>>>>>
> >>>>>> In short, each row on the fact table needs to be joined to a key
+
Something Something 2013-07-19, 23:33
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB