Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Understand Schema after a Join


Copy link to this message
-
Re: Understand Schema after a Join
Is implicit scalar conversion going to stay in PIG? My preference would to
make it explicit like SCALAR(ACCT.year)..

On Fri, Jul 29, 2011 at 11:00 AM, Raghu Angadi <[EMAIL PROTECTED]> wrote:

> > GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days);
>
> should be GENERATE ACCT::year, ACCT::month ... etc.
>
> this is a common mistake to use '.' instead of '::'.. I wish the error
> message is more user friendly..
> PIG supports 'scalars' and assumes your ACCT would be a single row table at
> runtime when access fields like ACCT.year.
>
>
> On Fri, Jul 29, 2011 at 9:47 AM, rob parker <[EMAIL PROTECTED]>wrote:
>
>> Trying to join two sets and generate a set from the join and I am getting
>> a
>>
>>
>> $ hadoop fs -cat DIM/\*
>> 2011,01,31
>> 2011,02,28
>> 2011,03,31
>> 2011,04,30
>> 2011,05,31
>> 2011,06,30
>> 2011,07,31
>> 2011,08,31
>> 2011,09,30
>> 2011,10,31
>> 2011,11,30
>> 2011,12,31
>>
>>
>> $ hadoop fs -cat ACCT/\*
>> 2011,7,26,key1,23.25,2470.0
>> 2011,7,26,key2,10.416666666666668,232274.08333333334
>> 2011,7,26,key3,82.83333333333333,541377.25
>> 2011,7,26,key4,78.5,492823.33333333326
>> 2011,7,26,key5,110.83333333333334,729811.9166666667
>> 2011,7,26,key6,102.16666666666666,675941.25
>> 2011,7,26,key7,118.91666666666666,770896.75
>>
>>
>> grunt> DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int,
>> days:int);
>> grunt> ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int,
>> day: int, account:chararray, metric1:double, metric2:double);
>> grunt> AjD = JOIN ACCT BY (year,month), DIM  BY (year,month) USING
>> 'replicated';
>> grunt> dump AjD;
>> ...
>> (2011,7,26,key1,23.25,2470.0,2011,7,31)
>> (2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31)
>> (2011,7,26,key3,82.83333333333333,541377.25,2011,7,31)
>> (2011,7,26,key4,78.5,492823.33333333326,2011,7,31)
>> (2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31)
>> (2011,7,26,key6,102.16666666666666,675941.25,2011,7,31)
>> (2011,7,26,key7,118.91666666666666,770896.75,2011,7,31)
>> grunt> describe AjD;
>> AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account:
>> chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year:
>> int,DIM::month: int,DIM::days: int}
>>
>> grunt> FINAL = FOREACH AjD
>> >> GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 /
>> DIM.days);
>> grunt> dump FINAL;
>> ...
>> ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open
>> iterator
>> for alias FINAL. Backend error : Scalar has more than one row in the
>> output.
>> 1st : (2011,7,26,key1,23.25,2470.0), 2nd
>> :(2011,7,26,key2,10.416666666666668,232274.08333333334)
>>
>> However if I store it and reload it to shed the "join" schema:
>>
>> grunt> STORE AjD INTO 'AjD' using PigStorage(',');
>> grunt> AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int,
>> day:int, account:chararray, metric1:double, metric2:double, year2:int,
>> month2:int, days:int);
>>
>> grunt> FINAL = FOREACH AjD2
>>
>> >> GENERATE year, month, account, (metric2 /days);
>>
>> grunt> dump FINAL;
>> ...
>> (2011,7,key1,79.6774193548387)
>> (2011,7,key2,7492.712365591398)
>> (2011,7,key3,17463.782258064515)
>> (2011,7,key4,15897.526881720427)
>> (2011,7,key5,23542.319892473122)
>> (2011,7,key6,21804.5564516129)
>> (2011,7,key7,24867.637096774193)
>>
>> What am I missing to make this work without storing and reloading?
>>
>> Thanks,
>> Rob
>>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB