Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Understand Schema after a Join


+
rob parker 2011-07-29, 16:47
Copy link to this message
-
Re: Understand Schema after a Join
Isn't the query parser getting confused because you've chosen the same
column name mappings for the LHS and RHS of your join?

What happens if you change this:

AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING 'replicated';

to:

AjD = JOIN ACCT BY ($0,$1), DIM BY ($0,$1) USING 'replicated';

Norbert

On Fri, Jul 29, 2011 at 12:47 PM, rob parker <[EMAIL PROTECTED]> wrote:

> Trying to join two sets and generate a set from the join and I am getting a
>
>
> $ hadoop fs -cat DIM/\*
> 2011,01,31
> 2011,02,28
> 2011,03,31
> 2011,04,30
> 2011,05,31
> 2011,06,30
> 2011,07,31
> 2011,08,31
> 2011,09,30
> 2011,10,31
> 2011,11,30
> 2011,12,31
>
>
> $ hadoop fs -cat ACCT/\*
> 2011,7,26,key1,23.25,2470.0
> 2011,7,26,key2,10.416666666666668,232274.08333333334
> 2011,7,26,key3,82.83333333333333,541377.25
> 2011,7,26,key4,78.5,492823.33333333326
> 2011,7,26,key5,110.83333333333334,729811.9166666667
> 2011,7,26,key6,102.16666666666666,675941.25
> 2011,7,26,key7,118.91666666666666,770896.75
>
>
> grunt> DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int,
> days:int);
> grunt> ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int,
> day: int, account:chararray, metric1:double, metric2:double);
> grunt> AjD = JOIN ACCT BY (year,month), DIM  BY (year,month) USING
> 'replicated';
> grunt> dump AjD;
> ...
> (2011,7,26,key1,23.25,2470.0,2011,7,31)
> (2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31)
> (2011,7,26,key3,82.83333333333333,541377.25,2011,7,31)
> (2011,7,26,key4,78.5,492823.33333333326,2011,7,31)
> (2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31)
> (2011,7,26,key6,102.16666666666666,675941.25,2011,7,31)
> (2011,7,26,key7,118.91666666666666,770896.75,2011,7,31)
> grunt> describe AjD;
> AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account:
> chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year:
> int,DIM::month: int,DIM::days: int}
>
> grunt> FINAL = FOREACH AjD
> >> GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days);
> grunt> dump FINAL;
> ...
> ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open
> iterator
> for alias FINAL. Backend error : Scalar has more than one row in the
> output.
> 1st : (2011,7,26,key1,23.25,2470.0), 2nd
> :(2011,7,26,key2,10.416666666666668,232274.08333333334)
>
> However if I store it and reload it to shed the "join" schema:
>
> grunt> STORE AjD INTO 'AjD' using PigStorage(',');
> grunt> AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int,
> day:int, account:chararray, metric1:double, metric2:double, year2:int,
> month2:int, days:int);
>
> grunt> FINAL = FOREACH AjD2
>
> >> GENERATE year, month, account, (metric2 /days);
>
> grunt> dump FINAL;
> ...
> (2011,7,key1,79.6774193548387)
> (2011,7,key2,7492.712365591398)
> (2011,7,key3,17463.782258064515)
> (2011,7,key4,15897.526881720427)
> (2011,7,key5,23542.319892473122)
> (2011,7,key6,21804.5564516129)
> (2011,7,key7,24867.637096774193)
>
> What am I missing to make this work without storing and reloading?
>
> Thanks,
> Rob
>
+
Raghu Angadi 2011-07-29, 18:00
+
Raghu Angadi 2011-07-29, 18:07
+
rob parker 2011-07-29, 19:39
+
Thejas Nair 2011-07-29, 20:32
+
Thejas Nair 2011-07-29, 20:32
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB