Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Getting dimension values for Facts


Copy link to this message
-
Re: Getting dimension values for Facts
Looks like this might be macroable. Not entirely sure how that can be done
yet... but I'd look into that if I were you.
On Thu, Jul 18, 2013 at 11:16 AM, Something Something <
[EMAIL PROTECTED]> wrote:

> Wow, Bertrand, on the Pig mailing list you're recommending not to use
> Pig... LOL!  Jokes apart, I would think this would be a common use case for
> Pig, no?  Generating a Pig script on the fly is a decent idea, but we're
> hoping to avoid that - unless there's no other way.  Thanks for the
> pointers.
>
>
> On Thu, Jul 18, 2013 at 2:52 AM, Bertrand Dechoux <[EMAIL PROTECTED]
> >wrote:
>
> > I would say either generate the script using another language (eg Python)
> > or use a true programming language with an API having the same level of
> > abstraction (eg Java and Cascading).
> >
> > Bertrand
> >
> >
> > On Thu, Jul 18, 2013 at 8:44 AM, Something Something <
> > [EMAIL PROTECTED]> wrote:
> >
> > > There must be a better way to do this in Pig.  Here's how my script
> looks
> > > like right now:  (omitted some snippet for saving space, but you will
> get
> > > the idea).
> > >
> > > FACT_TABLE = LOAD 'XYZ'  as (col1 :chararray,………. col30: chararray);
> > >
> > > FACT_TABLE1  = FOREACH FACT_TABLE GENERATE col1, udf1(col2) as col2,…..
> > > udf10(col30) as col30;
> > >
> > > DIMENSION1 = LOAD 'DIM1' as (key, value);
> > >
> > > FACT_TABLE2 = JOIN FACT_TABLE1 BY col1 LEFT OUTER, DIMENSION1 BY key;
> > >
> > > FACT_TABLE3  = FOREACH FACT_TABLE2 GENERATE DIMENSION1::value as
> col1,…….
> > >  FACT_TABLE1::col30 as col30;
> > >
> > > DIMENSION2 = LOAD 'DIM2' as (key, value);
> > >
> > > FACT_TABLE4 = JOIN FACT_TABLE3 BY col2 LEFT OUTER, DIMENSION2 BY key;
> > >
> > > FACT_TABLE5  = FOREACH FACT_TABLE4 GENERATE  FACT_TABLE3::col1 as
> > > col1, DIMENSION2::value as col2,…….  FACT_TABLE3::col30 as col30;
> > >
> > > & so on!  There are 10 more such dimension tables to join.
> > >
> > > In short, each row on the fact table needs to be joined to a key field
> > on a
> > > dimension table to get it's associated value.
> > >
> > > This is beginning to look ugly.  Plus it's maintenance nightmare when
> it
> > > comes to adding new fields.  What's the best way to code this in Pig?
> > >
> > > Thanks in advance.
> > >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB