Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Getting dimension values for Facts


Copy link to this message
-
Re: Getting dimension values for Facts
Unfortunately I can't think of any good way of doing this (other than what
Bertrand suggested with using a different language to generate the script).

I'd also recommend Hive... it may be easier to do this in Hive since you
have SQL like syntax. (Haven't used Hive, but it looks like this type of
thing would be far more natural in Hive)
On Thu, Jul 18, 2013 at 12:09 PM, Something Something <
[EMAIL PROTECTED]> wrote:

> I don't think this is macro-able, Pradeep.  Every step of the way a
> different column gets updated.  For example, for FACT_TABLE3 we update
> 'col1' from DIMENSION1, for FACT_TABLE5 we update 'col2' from DIMENSION2 &
> so on.
>
> Feel free to correct me if I am wrong.  Thanks.
>
>
>
>
>
> On Thu, Jul 18, 2013 at 8:25 AM, Pradeep Gollakota <[EMAIL PROTECTED]
> >wrote:
>
> > Looks like this might be macroable. Not entirely sure how that can be
> done
> > yet... but I'd look into that if I were you.
> >
> >
> > On Thu, Jul 18, 2013 at 11:16 AM, Something Something <
> > [EMAIL PROTECTED]> wrote:
> >
> > > Wow, Bertrand, on the Pig mailing list you're recommending not to use
> > > Pig... LOL!  Jokes apart, I would think this would be a common use case
> > for
> > > Pig, no?  Generating a Pig script on the fly is a decent idea, but
> we're
> > > hoping to avoid that - unless there's no other way.  Thanks for the
> > > pointers.
> > >
> > >
> > > On Thu, Jul 18, 2013 at 2:52 AM, Bertrand Dechoux <[EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > I would say either generate the script using another language (eg
> > Python)
> > > > or use a true programming language with an API having the same level
> of
> > > > abstraction (eg Java and Cascading).
> > > >
> > > > Bertrand
> > > >
> > > >
> > > > On Thu, Jul 18, 2013 at 8:44 AM, Something Something <
> > > > [EMAIL PROTECTED]> wrote:
> > > >
> > > > > There must be a better way to do this in Pig.  Here's how my script
> > > looks
> > > > > like right now:  (omitted some snippet for saving space, but you
> will
> > > get
> > > > > the idea).
> > > > >
> > > > > FACT_TABLE = LOAD 'XYZ'  as (col1 :chararray,………. col30:
> chararray);
> > > > >
> > > > > FACT_TABLE1  = FOREACH FACT_TABLE GENERATE col1, udf1(col2) as
> > col2,…..
> > > > > udf10(col30) as col30;
> > > > >
> > > > > DIMENSION1 = LOAD 'DIM1' as (key, value);
> > > > >
> > > > > FACT_TABLE2 = JOIN FACT_TABLE1 BY col1 LEFT OUTER, DIMENSION1 BY
> key;
> > > > >
> > > > > FACT_TABLE3  = FOREACH FACT_TABLE2 GENERATE DIMENSION1::value as
> > > col1,…….
> > > > >  FACT_TABLE1::col30 as col30;
> > > > >
> > > > > DIMENSION2 = LOAD 'DIM2' as (key, value);
> > > > >
> > > > > FACT_TABLE4 = JOIN FACT_TABLE3 BY col2 LEFT OUTER, DIMENSION2 BY
> key;
> > > > >
> > > > > FACT_TABLE5  = FOREACH FACT_TABLE4 GENERATE  FACT_TABLE3::col1 as
> > > > > col1, DIMENSION2::value as col2,…….  FACT_TABLE3::col30 as col30;
> > > > >
> > > > > & so on!  There are 10 more such dimension tables to join.
> > > > >
> > > > > In short, each row on the fact table needs to be joined to a key
> > field
> > > > on a
> > > > > dimension table to get it's associated value.
> > > > >
> > > > > This is beginning to look ugly.  Plus it's maintenance nightmare
> when
> > > it
> > > > > comes to adding new fields.  What's the best way to code this in
> Pig?
> > > > >
> > > > > Thanks in advance.
> > > > >
> > > >
> > >
> >
>