|
|
-
Understand Schema after a Join
rob parker 2011-07-29, 16:47
Trying to join two sets and generate a set from the join and I am getting a $ hadoop fs -cat DIM/\* 2011,01,31 2011,02,28 2011,03,31 2011,04,30 2011,05,31 2011,06,30 2011,07,31 2011,08,31 2011,09,30 2011,10,31 2011,11,30 2011,12,31 $ hadoop fs -cat ACCT/\* 2011,7,26,key1,23.25,2470.0 2011,7,26,key2,10.416666666666668,232274.08333333334 2011,7,26,key3,82.83333333333333,541377.25 2011,7,26,key4,78.5,492823.33333333326 2011,7,26,key5,110.83333333333334,729811.9166666667 2011,7,26,key6,102.16666666666666,675941.25 2011,7,26,key7,118.91666666666666,770896.75 grunt> DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int, days:int); grunt> ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int, day: int, account:chararray, metric1:double, metric2:double); grunt> AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING 'replicated'; grunt> dump AjD; ... (2011,7,26,key1,23.25,2470.0,2011,7,31) (2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31) (2011,7,26,key3,82.83333333333333,541377.25,2011,7,31) (2011,7,26,key4,78.5,492823.33333333326,2011,7,31) (2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31) (2011,7,26,key6,102.16666666666666,675941.25,2011,7,31) (2011,7,26,key7,118.91666666666666,770896.75,2011,7,31) grunt> describe AjD; AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account: chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year: int,DIM::month: int,DIM::days: int}
grunt> FINAL = FOREACH AjD >> GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days); grunt> dump FINAL; ... ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias FINAL. Backend error : Scalar has more than one row in the output. 1st : (2011,7,26,key1,23.25,2470.0), 2nd :(2011,7,26,key2,10.416666666666668,232274.08333333334)
However if I store it and reload it to shed the "join" schema:
grunt> STORE AjD INTO 'AjD' using PigStorage(','); grunt> AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int, day:int, account:chararray, metric1:double, metric2:double, year2:int, month2:int, days:int);
grunt> FINAL = FOREACH AjD2
>> GENERATE year, month, account, (metric2 /days);
grunt> dump FINAL; ... (2011,7,key1,79.6774193548387) (2011,7,key2,7492.712365591398) (2011,7,key3,17463.782258064515) (2011,7,key4,15897.526881720427) (2011,7,key5,23542.319892473122) (2011,7,key6,21804.5564516129) (2011,7,key7,24867.637096774193)
What am I missing to make this work without storing and reloading?
Thanks, Rob
-
Re: Understand Schema after a Join
Norbert Burger 2011-07-29, 17:35
Isn't the query parser getting confused because you've chosen the same column name mappings for the LHS and RHS of your join?
What happens if you change this:
AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING 'replicated';
to:
AjD = JOIN ACCT BY ($0,$1), DIM BY ($0,$1) USING 'replicated';
Norbert
On Fri, Jul 29, 2011 at 12:47 PM, rob parker <[EMAIL PROTECTED]> wrote:
> Trying to join two sets and generate a set from the join and I am getting a > > > $ hadoop fs -cat DIM/\* > 2011,01,31 > 2011,02,28 > 2011,03,31 > 2011,04,30 > 2011,05,31 > 2011,06,30 > 2011,07,31 > 2011,08,31 > 2011,09,30 > 2011,10,31 > 2011,11,30 > 2011,12,31 > > > $ hadoop fs -cat ACCT/\* > 2011,7,26,key1,23.25,2470.0 > 2011,7,26,key2,10.416666666666668,232274.08333333334 > 2011,7,26,key3,82.83333333333333,541377.25 > 2011,7,26,key4,78.5,492823.33333333326 > 2011,7,26,key5,110.83333333333334,729811.9166666667 > 2011,7,26,key6,102.16666666666666,675941.25 > 2011,7,26,key7,118.91666666666666,770896.75 > > > grunt> DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int, > days:int); > grunt> ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int, > day: int, account:chararray, metric1:double, metric2:double); > grunt> AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING > 'replicated'; > grunt> dump AjD; > ... > (2011,7,26,key1,23.25,2470.0,2011,7,31) > (2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31) > (2011,7,26,key3,82.83333333333333,541377.25,2011,7,31) > (2011,7,26,key4,78.5,492823.33333333326,2011,7,31) > (2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31) > (2011,7,26,key6,102.16666666666666,675941.25,2011,7,31) > (2011,7,26,key7,118.91666666666666,770896.75,2011,7,31) > grunt> describe AjD; > AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account: > chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year: > int,DIM::month: int,DIM::days: int} > > grunt> FINAL = FOREACH AjD > >> GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days); > grunt> dump FINAL; > ... > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open > iterator > for alias FINAL. Backend error : Scalar has more than one row in the > output. > 1st : (2011,7,26,key1,23.25,2470.0), 2nd > :(2011,7,26,key2,10.416666666666668,232274.08333333334) > > However if I store it and reload it to shed the "join" schema: > > grunt> STORE AjD INTO 'AjD' using PigStorage(','); > grunt> AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int, > day:int, account:chararray, metric1:double, metric2:double, year2:int, > month2:int, days:int); > > grunt> FINAL = FOREACH AjD2 > > >> GENERATE year, month, account, (metric2 /days); > > grunt> dump FINAL; > ... > (2011,7,key1,79.6774193548387) > (2011,7,key2,7492.712365591398) > (2011,7,key3,17463.782258064515) > (2011,7,key4,15897.526881720427) > (2011,7,key5,23542.319892473122) > (2011,7,key6,21804.5564516129) > (2011,7,key7,24867.637096774193) > > What am I missing to make this work without storing and reloading? > > Thanks, > Rob >
-
Re: Understand Schema after a Join
Raghu Angadi 2011-07-29, 18:00
> GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days);
should be GENERATE ACCT::year, ACCT::month ... etc.
this is a common mistake to use '.' instead of '::'.. I wish the error message is more user friendly.. PIG supports 'scalars' and assumes your ACCT would be a single row table at runtime when access fields like ACCT.year. On Fri, Jul 29, 2011 at 9:47 AM, rob parker <[EMAIL PROTECTED]> wrote:
> Trying to join two sets and generate a set from the join and I am getting a > > > $ hadoop fs -cat DIM/\* > 2011,01,31 > 2011,02,28 > 2011,03,31 > 2011,04,30 > 2011,05,31 > 2011,06,30 > 2011,07,31 > 2011,08,31 > 2011,09,30 > 2011,10,31 > 2011,11,30 > 2011,12,31 > > > $ hadoop fs -cat ACCT/\* > 2011,7,26,key1,23.25,2470.0 > 2011,7,26,key2,10.416666666666668,232274.08333333334 > 2011,7,26,key3,82.83333333333333,541377.25 > 2011,7,26,key4,78.5,492823.33333333326 > 2011,7,26,key5,110.83333333333334,729811.9166666667 > 2011,7,26,key6,102.16666666666666,675941.25 > 2011,7,26,key7,118.91666666666666,770896.75 > > > grunt> DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int, > days:int); > grunt> ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int, > day: int, account:chararray, metric1:double, metric2:double); > grunt> AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING > 'replicated'; > grunt> dump AjD; > ... > (2011,7,26,key1,23.25,2470.0,2011,7,31) > (2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31) > (2011,7,26,key3,82.83333333333333,541377.25,2011,7,31) > (2011,7,26,key4,78.5,492823.33333333326,2011,7,31) > (2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31) > (2011,7,26,key6,102.16666666666666,675941.25,2011,7,31) > (2011,7,26,key7,118.91666666666666,770896.75,2011,7,31) > grunt> describe AjD; > AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account: > chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year: > int,DIM::month: int,DIM::days: int} > > grunt> FINAL = FOREACH AjD > >> GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days); > grunt> dump FINAL; > ... > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open > iterator > for alias FINAL. Backend error : Scalar has more than one row in the > output. > 1st : (2011,7,26,key1,23.25,2470.0), 2nd > :(2011,7,26,key2,10.416666666666668,232274.08333333334) > > However if I store it and reload it to shed the "join" schema: > > grunt> STORE AjD INTO 'AjD' using PigStorage(','); > grunt> AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int, > day:int, account:chararray, metric1:double, metric2:double, year2:int, > month2:int, days:int); > > grunt> FINAL = FOREACH AjD2 > > >> GENERATE year, month, account, (metric2 /days); > > grunt> dump FINAL; > ... > (2011,7,key1,79.6774193548387) > (2011,7,key2,7492.712365591398) > (2011,7,key3,17463.782258064515) > (2011,7,key4,15897.526881720427) > (2011,7,key5,23542.319892473122) > (2011,7,key6,21804.5564516129) > (2011,7,key7,24867.637096774193) > > What am I missing to make this work without storing and reloading? > > Thanks, > Rob >
-
Re: Understand Schema after a Join
Raghu Angadi 2011-07-29, 18:07
Is implicit scalar conversion going to stay in PIG? My preference would to make it explicit like SCALAR(ACCT.year)..
On Fri, Jul 29, 2011 at 11:00 AM, Raghu Angadi <[EMAIL PROTECTED]> wrote:
> > GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days); > > should be GENERATE ACCT::year, ACCT::month ... etc. > > this is a common mistake to use '.' instead of '::'.. I wish the error > message is more user friendly.. > PIG supports 'scalars' and assumes your ACCT would be a single row table at > runtime when access fields like ACCT.year. > > > On Fri, Jul 29, 2011 at 9:47 AM, rob parker <[EMAIL PROTECTED]>wrote: > >> Trying to join two sets and generate a set from the join and I am getting >> a >> >> >> $ hadoop fs -cat DIM/\* >> 2011,01,31 >> 2011,02,28 >> 2011,03,31 >> 2011,04,30 >> 2011,05,31 >> 2011,06,30 >> 2011,07,31 >> 2011,08,31 >> 2011,09,30 >> 2011,10,31 >> 2011,11,30 >> 2011,12,31 >> >> >> $ hadoop fs -cat ACCT/\* >> 2011,7,26,key1,23.25,2470.0 >> 2011,7,26,key2,10.416666666666668,232274.08333333334 >> 2011,7,26,key3,82.83333333333333,541377.25 >> 2011,7,26,key4,78.5,492823.33333333326 >> 2011,7,26,key5,110.83333333333334,729811.9166666667 >> 2011,7,26,key6,102.16666666666666,675941.25 >> 2011,7,26,key7,118.91666666666666,770896.75 >> >> >> grunt> DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int, >> days:int); >> grunt> ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int, >> day: int, account:chararray, metric1:double, metric2:double); >> grunt> AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING >> 'replicated'; >> grunt> dump AjD; >> ... >> (2011,7,26,key1,23.25,2470.0,2011,7,31) >> (2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31) >> (2011,7,26,key3,82.83333333333333,541377.25,2011,7,31) >> (2011,7,26,key4,78.5,492823.33333333326,2011,7,31) >> (2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31) >> (2011,7,26,key6,102.16666666666666,675941.25,2011,7,31) >> (2011,7,26,key7,118.91666666666666,770896.75,2011,7,31) >> grunt> describe AjD; >> AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account: >> chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year: >> int,DIM::month: int,DIM::days: int} >> >> grunt> FINAL = FOREACH AjD >> >> GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / >> DIM.days); >> grunt> dump FINAL; >> ... >> ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open >> iterator >> for alias FINAL. Backend error : Scalar has more than one row in the >> output. >> 1st : (2011,7,26,key1,23.25,2470.0), 2nd >> :(2011,7,26,key2,10.416666666666668,232274.08333333334) >> >> However if I store it and reload it to shed the "join" schema: >> >> grunt> STORE AjD INTO 'AjD' using PigStorage(','); >> grunt> AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int, >> day:int, account:chararray, metric1:double, metric2:double, year2:int, >> month2:int, days:int); >> >> grunt> FINAL = FOREACH AjD2 >> >> >> GENERATE year, month, account, (metric2 /days); >> >> grunt> dump FINAL; >> ... >> (2011,7,key1,79.6774193548387) >> (2011,7,key2,7492.712365591398) >> (2011,7,key3,17463.782258064515) >> (2011,7,key4,15897.526881720427) >> (2011,7,key5,23542.319892473122) >> (2011,7,key6,21804.5564516129) >> (2011,7,key7,24867.637096774193) >> >> What am I missing to make this work without storing and reloading? >> >> Thanks, >> Rob >> > >
-
Re: Understand Schema after a Join
rob parker 2011-07-29, 19:39
Thanks!
Rob
On Fri, Jul 29, 2011 at 11:07 AM, Raghu Angadi <[EMAIL PROTECTED]> wrote:
> Is implicit scalar conversion going to stay in PIG? My preference would to > make it explicit like SCALAR(ACCT.year).. > > On Fri, Jul 29, 2011 at 11:00 AM, Raghu Angadi <[EMAIL PROTECTED]> wrote: > > > > GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / > DIM.days); > > > > should be GENERATE ACCT::year, ACCT::month ... etc. > > > > this is a common mistake to use '.' instead of '::'.. I wish the error > > message is more user friendly.. > > PIG supports 'scalars' and assumes your ACCT would be a single row table > at > > runtime when access fields like ACCT.year. > > > > > > On Fri, Jul 29, 2011 at 9:47 AM, rob parker <[EMAIL PROTECTED] > >wrote: > > > >> Trying to join two sets and generate a set from the join and I am > getting > >> a > >> > >> > >> $ hadoop fs -cat DIM/\* > >> 2011,01,31 > >> 2011,02,28 > >> 2011,03,31 > >> 2011,04,30 > >> 2011,05,31 > >> 2011,06,30 > >> 2011,07,31 > >> 2011,08,31 > >> 2011,09,30 > >> 2011,10,31 > >> 2011,11,30 > >> 2011,12,31 > >> > >> > >> $ hadoop fs -cat ACCT/\* > >> 2011,7,26,key1,23.25,2470.0 > >> 2011,7,26,key2,10.416666666666668,232274.08333333334 > >> 2011,7,26,key3,82.83333333333333,541377.25 > >> 2011,7,26,key4,78.5,492823.33333333326 > >> 2011,7,26,key5,110.83333333333334,729811.9166666667 > >> 2011,7,26,key6,102.16666666666666,675941.25 > >> 2011,7,26,key7,118.91666666666666,770896.75 > >> > >> > >> grunt> DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int, > >> days:int); > >> grunt> ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int, > >> day: int, account:chararray, metric1:double, metric2:double); > >> grunt> AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING > >> 'replicated'; > >> grunt> dump AjD; > >> ... > >> (2011,7,26,key1,23.25,2470.0,2011,7,31) > >> (2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31) > >> (2011,7,26,key3,82.83333333333333,541377.25,2011,7,31) > >> (2011,7,26,key4,78.5,492823.33333333326,2011,7,31) > >> (2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31) > >> (2011,7,26,key6,102.16666666666666,675941.25,2011,7,31) > >> (2011,7,26,key7,118.91666666666666,770896.75,2011,7,31) > >> grunt> describe AjD; > >> AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account: > >> chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year: > >> int,DIM::month: int,DIM::days: int} > >> > >> grunt> FINAL = FOREACH AjD > >> >> GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / > >> DIM.days); > >> grunt> dump FINAL; > >> ... > >> ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open > >> iterator > >> for alias FINAL. Backend error : Scalar has more than one row in the > >> output. > >> 1st : (2011,7,26,key1,23.25,2470.0), 2nd > >> :(2011,7,26,key2,10.416666666666668,232274.08333333334) > >> > >> However if I store it and reload it to shed the "join" schema: > >> > >> grunt> STORE AjD INTO 'AjD' using PigStorage(','); > >> grunt> AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int, > >> day:int, account:chararray, metric1:double, metric2:double, year2:int, > >> month2:int, days:int); > >> > >> grunt> FINAL = FOREACH AjD2 > >> > >> >> GENERATE year, month, account, (metric2 /days); > >> > >> grunt> dump FINAL; > >> ... > >> (2011,7,key1,79.6774193548387) > >> (2011,7,key2,7492.712365591398) > >> (2011,7,key3,17463.782258064515) > >> (2011,7,key4,15897.526881720427) > >> (2011,7,key5,23542.319892473122) > >> (2011,7,key6,21804.5564516129) > >> (2011,7,key7,24867.637096774193) > >> > >> What am I missing to make this work without storing and reloading? > >> > >> Thanks, > >> Rob > >> > > > > >
-
Re: Understand Schema after a Join
Thejas Nair 2011-07-29, 20:32
On 7/29/11 11:07 AM, Raghu Angadi wrote: > Is implicit scalar conversion going to stay in PIG? My preference would to > make it explicit like SCALAR(ACCT.year).. > That's my preference as well , you can vote/ submit a patch ! - https://issues.apache.org/jira/browse/PIG-1967-Thejas
-
Re: Understand Schema after a Join
Thejas Nair 2011-07-29, 20:32
On 7/29/11 9:47 AM, rob parker wrote: > grunt> describe AjD; > AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account: > chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year: > int,DIM::month: int,DIM::days: int} > > grunt> FINAL = FOREACH AjD >>> GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days); > grunt> dump FINAL; > ... > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator > for alias FINAL. Backend error : Scalar has more than one row in the output. > 1st : (2011,7,26,key1,23.25,2470.0), 2nd > :(2011,7,26,key2,10.416666666666668,232274.08333333334) The column in AjD schema is ACCT::year, not ACCT.year (note the "::" vs "."). So you need to change it to - grunt> FINAL = FOREACH AjD GENERATE ACCT::year, ACCT::month, ACCT::account, (ACCT::metric2 / DIM::days); You are accidentally using the relation-as-scalar feature here. http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#Casting+Relations+to+ScalarsThere is a jira open to deprecate the syntax for this feature so that users don't accidentally end up using it - https://issues.apache.org/jira/browse/PIG-1967-Thejas
|
|