Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Understand Schema after a Join


Copy link to this message
-
Re: Understand Schema after a Join
Raghu Angadi 2011-07-29, 18:07
Is implicit scalar conversion going to stay in PIG? My preference would to
make it explicit like SCALAR(ACCT.year)..

On Fri, Jul 29, 2011 at 11:00 AM, Raghu Angadi <[EMAIL PROTECTED]> wrote:

> > GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days);
>
> should be GENERATE ACCT::year, ACCT::month ... etc.
>
> this is a common mistake to use '.' instead of '::'.. I wish the error
> message is more user friendly..
> PIG supports 'scalars' and assumes your ACCT would be a single row table at
> runtime when access fields like ACCT.year.
>
>
> On Fri, Jul 29, 2011 at 9:47 AM, rob parker <[EMAIL PROTECTED]>wrote:
>
>> Trying to join two sets and generate a set from the join and I am getting
>> a
>>
>>
>> $ hadoop fs -cat DIM/\*
>> 2011,01,31
>> 2011,02,28
>> 2011,03,31
>> 2011,04,30
>> 2011,05,31
>> 2011,06,30
>> 2011,07,31
>> 2011,08,31
>> 2011,09,30
>> 2011,10,31
>> 2011,11,30
>> 2011,12,31
>>
>>
>> $ hadoop fs -cat ACCT/\*
>> 2011,7,26,key1,23.25,2470.0
>> 2011,7,26,key2,10.416666666666668,232274.08333333334
>> 2011,7,26,key3,82.83333333333333,541377.25
>> 2011,7,26,key4,78.5,492823.33333333326
>> 2011,7,26,key5,110.83333333333334,729811.9166666667
>> 2011,7,26,key6,102.16666666666666,675941.25
>> 2011,7,26,key7,118.91666666666666,770896.75
>>
>>
>> grunt> DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int,
>> days:int);
>> grunt> ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int,
>> day: int, account:chararray, metric1:double, metric2:double);
>> grunt> AjD = JOIN ACCT BY (year,month), DIM  BY (year,month) USING
>> 'replicated';
>> grunt> dump AjD;
>> ...
>> (2011,7,26,key1,23.25,2470.0,2011,7,31)
>> (2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31)
>> (2011,7,26,key3,82.83333333333333,541377.25,2011,7,31)
>> (2011,7,26,key4,78.5,492823.33333333326,2011,7,31)
>> (2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31)
>> (2011,7,26,key6,102.16666666666666,675941.25,2011,7,31)
>> (2011,7,26,key7,118.91666666666666,770896.75,2011,7,31)
>> grunt> describe AjD;
>> AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account:
>> chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year:
>> int,DIM::month: int,DIM::days: int}
>>
>> grunt> FINAL = FOREACH AjD
>> >> GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 /
>> DIM.days);
>> grunt> dump FINAL;
>> ...
>> ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open
>> iterator
>> for alias FINAL. Backend error : Scalar has more than one row in the
>> output.
>> 1st : (2011,7,26,key1,23.25,2470.0), 2nd
>> :(2011,7,26,key2,10.416666666666668,232274.08333333334)
>>
>> However if I store it and reload it to shed the "join" schema:
>>
>> grunt> STORE AjD INTO 'AjD' using PigStorage(',');
>> grunt> AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int,
>> day:int, account:chararray, metric1:double, metric2:double, year2:int,
>> month2:int, days:int);
>>
>> grunt> FINAL = FOREACH AjD2
>>
>> >> GENERATE year, month, account, (metric2 /days);
>>
>> grunt> dump FINAL;
>> ...
>> (2011,7,key1,79.6774193548387)
>> (2011,7,key2,7492.712365591398)
>> (2011,7,key3,17463.782258064515)
>> (2011,7,key4,15897.526881720427)
>> (2011,7,key5,23542.319892473122)
>> (2011,7,key6,21804.5564516129)
>> (2011,7,key7,24867.637096774193)
>>
>> What am I missing to make this work without storing and reloading?
>>
>> Thanks,
>> Rob
>>
>
>