Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - PIG with -tagsource option behaves weird


Copy link to this message
-
Re: PIG with -tagsource option behaves weird
Prabu Dhakshinamurthy 2013-02-04, 21:07
I found from some other message that,
starting pig with the flag '-t ColumnMapKeyPrune' helps fixing this issue i.e.,
start pig using the commandpig -x local -t ColumnMapKeyPrune sample.pig.

On Sun, Feb 3, 2013 at 12:17 PM, Prabu Dhakshinamurthy
<[EMAIL PROTECTED]> wrote:
> Dump of A:
> (100,123.98.11.123,google.com,{(google)},20121201_G,20121201)
> (95,500.98.11.123,yahoo.com,{(yahoo)},20121201_Y,20121201)
> (107,123.98.11.123,google.com,{(google)},20121201_G,20121201)
> (156,123.98.11.123,cnn.com,{(cnn)},20121201_C,20121201)
> (100,500.98.11.123,ndtv.com,{(ndtv)},20121201_N,20121201)
> (200,123.98.11.123,google.com,{(google)},20121202_G,20121202)
> (283,500.98.11.123,yahoo.com,{(yahoo)},20121202_Y,20121202)
> (283,500.98.11.123,pinterest.com,{(pinterest)},20121202_P,20121202)
> (204,600.10.100.221,bbc.com,{(bbc)},20121202_B,20121202)
>
>
> Dump of B:
> (100,g,20121201)
> (95,y,20121201)
> (107,g,20121201)
> (156,c,20121201)
> (100,n,20121201)
> (200,g,20121202)
> (283,y,20121202)
> (283,p,20121202)
> (204,b,20121202)
>
> ILLUSTRATE B:
>
> | B     | ip:chararray     | domain_first_char:chararray     |
> filename:chararray
> |        | 123.98.11.123 | g                                           |
> 20121202
>
> As seen in Dump B, instead of printing the ip value as the first field (as
> in illustrate B), it prints the ts field.
>
>
> On Sun, Feb 3, 2013 at 11:56 AM, Prabu Dhakshinamurthy
> <[EMAIL PROTECTED]> wrote:
>>
>> I am using the -tagsource option while loading the input data in order to
>> identify the input source. It seems that, later while I project only
>> selected fields from the input tuple, there are some assumptions and certain
>> fields get projected all the time though I try to ignore them.
>>
>> Take a look at my script.
>>
>> rawdata = load 'data/201212*' using PigStorage(' ', '-tagsource') as
>> (filename:chararray, ts: int, ip: chararray, domain: chararray, answer:
>> chararray);
>>
>> A = foreach rawdata generate ts, ip, domain, answer,
>> CONCAT(CONCAT(filename, '_'), UPPER(SUBSTRING(domain, 0, 1))) as
>> domain_index, filename as filename;
>> B = foreach A generate ip as ip, SUBSTRING(domain, 0, 1) as
>> domain_first_char, filename;
>> dump A;
>> dump B;
>> ILLUSTRATE B;
>>
>> While creating B, I am trying to include only selected fields from A.
>> However, if I dump B, the 'ts' field (the first field in A) keeps appearing
>> in B. But in ILLUSTRATE B, everything looks nice as expected.
>>
>> I appreciate any help. Thanks!
>>
>> --
>>
>> Prabu D
>
>
>
>
> --
>
> Prabu Dhakshinamurthy
> Graduate student | CSE | UCSD

--
Prabu Dhakshinamurthy
Graduate student | CSE | UCSD