Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Getting dimension values for Facts


Copy link to this message
-
Re: Getting dimension values for Facts
Sajid Raza 2013-07-19, 14:10
Could you potentially store all of your facts in one table, join it agains your dimension table, and filter as needed?

On Jul 18, 2013, at 10:51 AM, Pradeep Gollakota <[EMAIL PROTECTED]> wrote:

> Unfortunately I can't think of any good way of doing this (other than what
> Bertrand suggested with using a different language to generate the script).
>
> I'd also recommend Hive... it may be easier to do this in Hive since you
> have SQL like syntax. (Haven't used Hive, but it looks like this type of
> thing would be far more natural in Hive)
>
>
> On Thu, Jul 18, 2013 at 12:09 PM, Something Something <
> [EMAIL PROTECTED]> wrote:
>
>> I don't think this is macro-able, Pradeep.  Every step of the way a
>> different column gets updated.  For example, for FACT_TABLE3 we update
>> 'col1' from DIMENSION1, for FACT_TABLE5 we update 'col2' from DIMENSION2 &
>> so on.
>>
>> Feel free to correct me if I am wrong.  Thanks.
>>
>>
>>
>>
>>
>> On Thu, Jul 18, 2013 at 8:25 AM, Pradeep Gollakota <[EMAIL PROTECTED]
>>> wrote:
>>
>>> Looks like this might be macroable. Not entirely sure how that can be
>> done
>>> yet... but I'd look into that if I were you.
>>>
>>>
>>> On Thu, Jul 18, 2013 at 11:16 AM, Something Something <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>> Wow, Bertrand, on the Pig mailing list you're recommending not to use
>>>> Pig... LOL!  Jokes apart, I would think this would be a common use case
>>> for
>>>> Pig, no?  Generating a Pig script on the fly is a decent idea, but
>> we're
>>>> hoping to avoid that - unless there's no other way.  Thanks for the
>>>> pointers.
>>>>
>>>>
>>>> On Thu, Jul 18, 2013 at 2:52 AM, Bertrand Dechoux <[EMAIL PROTECTED]
>>>>> wrote:
>>>>
>>>>> I would say either generate the script using another language (eg
>>> Python)
>>>>> or use a true programming language with an API having the same level
>> of
>>>>> abstraction (eg Java and Cascading).
>>>>>
>>>>> Bertrand
>>>>>
>>>>>
>>>>> On Thu, Jul 18, 2013 at 8:44 AM, Something Something <
>>>>> [EMAIL PROTECTED]> wrote:
>>>>>
>>>>>> There must be a better way to do this in Pig.  Here's how my script
>>>> looks
>>>>>> like right now:  (omitted some snippet for saving space, but you
>> will
>>>> get
>>>>>> the idea).
>>>>>>
>>>>>> FACT_TABLE = LOAD 'XYZ'  as (col1 :chararray,………. col30:
>> chararray);
>>>>>>
>>>>>> FACT_TABLE1  = FOREACH FACT_TABLE GENERATE col1, udf1(col2) as
>>> col2,…..
>>>>>> udf10(col30) as col30;
>>>>>>
>>>>>> DIMENSION1 = LOAD 'DIM1' as (key, value);
>>>>>>
>>>>>> FACT_TABLE2 = JOIN FACT_TABLE1 BY col1 LEFT OUTER, DIMENSION1 BY
>> key;
>>>>>>
>>>>>> FACT_TABLE3  = FOREACH FACT_TABLE2 GENERATE DIMENSION1::value as
>>>> col1,…….
>>>>>> FACT_TABLE1::col30 as col30;
>>>>>>
>>>>>> DIMENSION2 = LOAD 'DIM2' as (key, value);
>>>>>>
>>>>>> FACT_TABLE4 = JOIN FACT_TABLE3 BY col2 LEFT OUTER, DIMENSION2 BY
>> key;
>>>>>>
>>>>>> FACT_TABLE5  = FOREACH FACT_TABLE4 GENERATE  FACT_TABLE3::col1 as
>>>>>> col1, DIMENSION2::value as col2,…….  FACT_TABLE3::col30 as col30;
>>>>>>
>>>>>> & so on!  There are 10 more such dimension tables to join.
>>>>>>
>>>>>> In short, each row on the fact table needs to be joined to a key
>>> field
>>>>> on a
>>>>>> dimension table to get it's associated value.
>>>>>>
>>>>>> This is beginning to look ugly.  Plus it's maintenance nightmare
>> when
>>>> it
>>>>>> comes to adding new fields.  What's the best way to code this in
>> Pig?
>>>>>>
>>>>>> Thanks in advance.
>>>>>>
>>>>>
>>>>
>>>
>>