|
|
-
PIG with -tagsource option behaves weird
Prabu Dhakshinamurthy 2013-02-03, 19:56
I am using the -tagsource option while loading the input data in order to identify the input source. It seems that, later while I project only selected fields from the input tuple, there are some assumptions and certain fields get projected all the time though I try to ignore them.
Take a look at my script.
rawdata = load 'data/201212*' using PigStorage(' ', '-tagsource') as (filename:chararray, ts: int, ip: chararray, domain: chararray, answer: chararray);
A = foreach rawdata generate ts, ip, domain, answer, CONCAT(CONCAT(filename, '_'), UPPER(SUBSTRING(domain, 0, 1))) as domain_index, filename as filename; B = foreach A generate ip as ip, SUBSTRING(domain, 0, 1) as domain_first_char, filename; dump A; dump B; ILLUSTRATE B;
While creating B, I am trying to include only selected fields from A. However, if I dump B, the 'ts' field (the first field in A) keeps appearing in B. But in ILLUSTRATE B, everything looks nice as expected.
I appreciate any help. Thanks!
--
Prabu D
-
Re: PIG with -tagsource option behaves weird
Prabu Dhakshinamurthy 2013-02-03, 20:17
Dump of A: (100,123.98.11.123,google.com,{(google)},20121201_G,20121201) (95,500.98.11.123,yahoo.com,{(yahoo)},20121201_Y,20121201) (107,123.98.11.123,google.com,{(google)},20121201_G,20121201) (156,123.98.11.123,cnn.com,{(cnn)},20121201_C,20121201) (100,500.98.11.123,ndtv.com,{(ndtv)},20121201_N,20121201) (200,123.98.11.123,google.com,{(google)},20121202_G,20121202) (283,500.98.11.123,yahoo.com,{(yahoo)},20121202_Y,20121202) (283,500.98.11.123,pinterest.com,{(pinterest)},20121202_P,20121202) (204,600.10.100.221,bbc.com,{(bbc)},20121202_B,20121202) Dump of B: (100,g,20121201) (95,y,20121201) (107,g,20121201) (156,c,20121201) (100,n,20121201) (200,g,20121202) (283,y,20121202) (283,p,20121202) (204,b,20121202)
ILLUSTRATE B:
| B | ip:chararray | domain_first_char:chararray | filename:chararray | | 123.98.11.123 | g | 20121202
As seen in Dump B, instead of printing the ip value as the first field (as in illustrate B), it prints the ts field. On Sun, Feb 3, 2013 at 11:56 AM, Prabu Dhakshinamurthy < [EMAIL PROTECTED]> wrote:
> I am using the -tagsource option while loading the input data in order to > identify the input source. It seems that, later while I project only > selected fields from the input tuple, there are some assumptions and > certain fields get projected all the time though I try to ignore them. > > Take a look at my script. > > rawdata = load 'data/201212*' using PigStorage(' ', '-tagsource') as > (filename:chararray, ts: int, ip: chararray, domain: chararray, answer: > chararray); > > A = foreach rawdata generate ts, ip, domain, answer, > CONCAT(CONCAT(filename, '_'), UPPER(SUBSTRING(domain, 0, 1))) as > domain_index, filename as filename; > B = foreach A generate ip as ip, SUBSTRING(domain, 0, 1) as > domain_first_char, filename; > dump A; > dump B; > ILLUSTRATE B; > > While creating B, I am trying to include only selected fields from A. > However, if I dump B, the 'ts' field (the first field in A) keeps appearing > in B. But in ILLUSTRATE B, everything looks nice as expected. > > I appreciate any help. Thanks! > > -- > > Prabu D > > --
Prabu Dhakshinamurthy Graduate student | CSE | UCSD
-
Re: PIG with -tagsource option behaves weird
Prabu Dhakshinamurthy 2013-02-04, 21:07
I found from some other message that, starting pig with the flag '-t ColumnMapKeyPrune' helps fixing this issue i.e., start pig using the commandpig -x local -t ColumnMapKeyPrune sample.pig.
On Sun, Feb 3, 2013 at 12:17 PM, Prabu Dhakshinamurthy <[EMAIL PROTECTED]> wrote: > Dump of A: > (100,123.98.11.123,google.com,{(google)},20121201_G,20121201) > (95,500.98.11.123,yahoo.com,{(yahoo)},20121201_Y,20121201) > (107,123.98.11.123,google.com,{(google)},20121201_G,20121201) > (156,123.98.11.123,cnn.com,{(cnn)},20121201_C,20121201) > (100,500.98.11.123,ndtv.com,{(ndtv)},20121201_N,20121201) > (200,123.98.11.123,google.com,{(google)},20121202_G,20121202) > (283,500.98.11.123,yahoo.com,{(yahoo)},20121202_Y,20121202) > (283,500.98.11.123,pinterest.com,{(pinterest)},20121202_P,20121202) > (204,600.10.100.221,bbc.com,{(bbc)},20121202_B,20121202) > > > Dump of B: > (100,g,20121201) > (95,y,20121201) > (107,g,20121201) > (156,c,20121201) > (100,n,20121201) > (200,g,20121202) > (283,y,20121202) > (283,p,20121202) > (204,b,20121202) > > ILLUSTRATE B: > > | B | ip:chararray | domain_first_char:chararray | > filename:chararray > | | 123.98.11.123 | g | > 20121202 > > As seen in Dump B, instead of printing the ip value as the first field (as > in illustrate B), it prints the ts field. > > > On Sun, Feb 3, 2013 at 11:56 AM, Prabu Dhakshinamurthy > <[EMAIL PROTECTED]> wrote: >> >> I am using the -tagsource option while loading the input data in order to >> identify the input source. It seems that, later while I project only >> selected fields from the input tuple, there are some assumptions and certain >> fields get projected all the time though I try to ignore them. >> >> Take a look at my script. >> >> rawdata = load 'data/201212*' using PigStorage(' ', '-tagsource') as >> (filename:chararray, ts: int, ip: chararray, domain: chararray, answer: >> chararray); >> >> A = foreach rawdata generate ts, ip, domain, answer, >> CONCAT(CONCAT(filename, '_'), UPPER(SUBSTRING(domain, 0, 1))) as >> domain_index, filename as filename; >> B = foreach A generate ip as ip, SUBSTRING(domain, 0, 1) as >> domain_first_char, filename; >> dump A; >> dump B; >> ILLUSTRATE B; >> >> While creating B, I am trying to include only selected fields from A. >> However, if I dump B, the 'ts' field (the first field in A) keeps appearing >> in B. But in ILLUSTRATE B, everything looks nice as expected. >> >> I appreciate any help. Thanks! >> >> -- >> >> Prabu D > > > > > -- > > Prabu Dhakshinamurthy > Graduate student | CSE | UCSD
-- Prabu Dhakshinamurthy Graduate student | CSE | UCSD
|
|