Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Understand Schema after a Join


Copy link to this message
-
Re: Understand Schema after a Join
Norbert Burger 2011-07-29, 17:35
Isn't the query parser getting confused because you've chosen the same
column name mappings for the LHS and RHS of your join?

What happens if you change this:

AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING 'replicated';

to:

AjD = JOIN ACCT BY ($0,$1), DIM BY ($0,$1) USING 'replicated';

Norbert

On Fri, Jul 29, 2011 at 12:47 PM, rob parker <[EMAIL PROTECTED]> wrote:

> Trying to join two sets and generate a set from the join and I am getting a
>
>
> $ hadoop fs -cat DIM/\*
> 2011,01,31
> 2011,02,28
> 2011,03,31
> 2011,04,30
> 2011,05,31
> 2011,06,30
> 2011,07,31
> 2011,08,31
> 2011,09,30
> 2011,10,31
> 2011,11,30
> 2011,12,31
>
>
> $ hadoop fs -cat ACCT/\*
> 2011,7,26,key1,23.25,2470.0
> 2011,7,26,key2,10.416666666666668,232274.08333333334
> 2011,7,26,key3,82.83333333333333,541377.25
> 2011,7,26,key4,78.5,492823.33333333326
> 2011,7,26,key5,110.83333333333334,729811.9166666667
> 2011,7,26,key6,102.16666666666666,675941.25
> 2011,7,26,key7,118.91666666666666,770896.75
>
>
> grunt> DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int,
> days:int);
> grunt> ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int,
> day: int, account:chararray, metric1:double, metric2:double);
> grunt> AjD = JOIN ACCT BY (year,month), DIM  BY (year,month) USING
> 'replicated';
> grunt> dump AjD;
> ...
> (2011,7,26,key1,23.25,2470.0,2011,7,31)
> (2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31)
> (2011,7,26,key3,82.83333333333333,541377.25,2011,7,31)
> (2011,7,26,key4,78.5,492823.33333333326,2011,7,31)
> (2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31)
> (2011,7,26,key6,102.16666666666666,675941.25,2011,7,31)
> (2011,7,26,key7,118.91666666666666,770896.75,2011,7,31)
> grunt> describe AjD;
> AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account:
> chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year:
> int,DIM::month: int,DIM::days: int}
>
> grunt> FINAL = FOREACH AjD
> >> GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days);
> grunt> dump FINAL;
> ...
> ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open
> iterator
> for alias FINAL. Backend error : Scalar has more than one row in the
> output.
> 1st : (2011,7,26,key1,23.25,2470.0), 2nd
> :(2011,7,26,key2,10.416666666666668,232274.08333333334)
>
> However if I store it and reload it to shed the "join" schema:
>
> grunt> STORE AjD INTO 'AjD' using PigStorage(',');
> grunt> AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int,
> day:int, account:chararray, metric1:double, metric2:double, year2:int,
> month2:int, days:int);
>
> grunt> FINAL = FOREACH AjD2
>
> >> GENERATE year, month, account, (metric2 /days);
>
> grunt> dump FINAL;
> ...
> (2011,7,key1,79.6774193548387)
> (2011,7,key2,7492.712365591398)
> (2011,7,key3,17463.782258064515)
> (2011,7,key4,15897.526881720427)
> (2011,7,key5,23542.319892473122)
> (2011,7,key6,21804.5564516129)
> (2011,7,key7,24867.637096774193)
>
> What am I missing to make this work without storing and reloading?
>
> Thanks,
> Rob
>