Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pig script to convert Categorical variables

Copy link to this message
Re: Pig script to convert Categorical variables
Interesting problem. What I'm thinking is why not do two steps. First,
read in the data, group on the column you care about. Then generate on
it so you get just the distinct values for that column left. This would
be something like:

Once you have that, convert it to a tuple, and then just write a quick
udf that goes through the ORIGINAL data set and takes in the row value
for the column you care about along with the distinct values tuple you
just created as parameters and returns a tuple of 0s and one 1 where the
one is in the position in the distinct values tuple that matches the row
value for that row for the column you care about. You could write that
udf in Java, Python, or one of the other supported udf languages,
depending on your requirements.

For inputting, you could do it either through a simple bash script (your
use case is simple enough, I think), or you could go ahead and embed the
PIG script in Java, Python, or one of the other languages that's
supported for that functionality, so it's easy to expand if you later
need to. I'm personally partial to Python and have had great results
embedding in that. Just make sure you're on Pig 9.1+.

Hopefully that helps,

On 2/20/12 6:56 AM, Prashant Kommireddi wrote:
> This should work if the values are only A,B,C.
> M = load 'input' as (city:chararray);
> N = foreach M generate city == 'A' ? 1 : 0 as A, city == 'B' ? 1 : 0 as B,
> city == 'C' ? 1 : 0 as C;
> However, if city values vary it might be a good option to do it by
> embedding Pig in Java.
> http://pig.apache.org/docs/r0.9.1/cont.html#embed-java
> Thanks,
> Prashant
> On Mon, Feb 20, 2012 at 3:16 AM, Austin Chungath<[EMAIL PROTECTED]>  wrote:
>> Consider this scenario:
>> I have a column named City and it takes 3 possible values: A,B,C
>> City
>> A
>> B
>> C
>> A
>> C
>> C
>> I want to convert it into
>> A             B            C
>> 1              0            0
>> 0              1            0
>> 0              0            1
>> 1              0            0
>> 0              0            1
>> 0              0            1
>> I am trying to write a pig script that will take two parameters, one
>> parameter is the data and then the column name, in this case 'City'. The
>> script should then identify distinct values that it will take and then
>> create that many columns and populate it with 1 or 0 depending on which one
>> is true.
>> Please let me know if you have got any ideas on how to approach this
>> problem.
>> Thanks,
>> Austin