Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> PIG with -tagsource option behaves weird


Copy link to this message
-
Re: PIG with -tagsource option behaves weird
I found from some other message that,
starting pig with the flag '-t ColumnMapKeyPrune' helps fixing this issue i.e.,
start pig using the commandpig -x local -t ColumnMapKeyPrune sample.pig.

On Sun, Feb 3, 2013 at 12:17 PM, Prabu Dhakshinamurthy
<[EMAIL PROTECTED]> wrote:
> Dump of A:
> (100,123.98.11.123,google.com,{(google)},20121201_G,20121201)
> (95,500.98.11.123,yahoo.com,{(yahoo)},20121201_Y,20121201)
> (107,123.98.11.123,google.com,{(google)},20121201_G,20121201)
> (156,123.98.11.123,cnn.com,{(cnn)},20121201_C,20121201)
> (100,500.98.11.123,ndtv.com,{(ndtv)},20121201_N,20121201)
> (200,123.98.11.123,google.com,{(google)},20121202_G,20121202)
> (283,500.98.11.123,yahoo.com,{(yahoo)},20121202_Y,20121202)
> (283,500.98.11.123,pinterest.com,{(pinterest)},20121202_P,20121202)
> (204,600.10.100.221,bbc.com,{(bbc)},20121202_B,20121202)
>
>
> Dump of B:
> (100,g,20121201)
> (95,y,20121201)
> (107,g,20121201)
> (156,c,20121201)
> (100,n,20121201)
> (200,g,20121202)
> (283,y,20121202)
> (283,p,20121202)
> (204,b,20121202)
>
> ILLUSTRATE B:
>
> | B     | ip:chararray     | domain_first_char:chararray     |
> filename:chararray
> |        | 123.98.11.123 | g                                           |
> 20121202
>
> As seen in Dump B, instead of printing the ip value as the first field (as
> in illustrate B), it prints the ts field.
>
>
> On Sun, Feb 3, 2013 at 11:56 AM, Prabu Dhakshinamurthy
> <[EMAIL PROTECTED]> wrote:
>>
>> I am using the -tagsource option while loading the input data in order to
>> identify the input source. It seems that, later while I project only
>> selected fields from the input tuple, there are some assumptions and certain
>> fields get projected all the time though I try to ignore them.
>>
>> Take a look at my script.
>>
>> rawdata = load 'data/201212*' using PigStorage(' ', '-tagsource') as
>> (filename:chararray, ts: int, ip: chararray, domain: chararray, answer:
>> chararray);
>>
>> A = foreach rawdata generate ts, ip, domain, answer,
>> CONCAT(CONCAT(filename, '_'), UPPER(SUBSTRING(domain, 0, 1))) as
>> domain_index, filename as filename;
>> B = foreach A generate ip as ip, SUBSTRING(domain, 0, 1) as
>> domain_first_char, filename;
>> dump A;
>> dump B;
>> ILLUSTRATE B;
>>
>> While creating B, I am trying to include only selected fields from A.
>> However, if I dump B, the 'ts' field (the first field in A) keeps appearing
>> in B. But in ILLUSTRATE B, everything looks nice as expected.
>>
>> I appreciate any help. Thanks!
>>
>> --
>>
>> Prabu D
>
>
>
>
> --
>
> Prabu Dhakshinamurthy
> Graduate student | CSE | UCSD

--
Prabu Dhakshinamurthy
Graduate student | CSE | UCSD
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB