Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pig script to convert Categorical variables


Copy link to this message
-
Re: Pig script to convert Categorical variables
No problem. Returning a variable schema sounds pretty cool and like
something that should be doable, but I'm not really sure how to go about
it. Maybe someone else knows?

Eli

On 2/21/12 1:27 AM, Austin Chungath wrote:
> Thanks Eli,
> That helps and it was exactly what I was doing. I wrote the UDF and it is
> working.
> I wrote a UDF that takes two parameters, first parameter was a bag of
> tuples containing distinct values (ordered ascending)  and the second
> parameter is the original data set. It is working but now I am trying
> to figure out how I can return a schema for the columns created with the
> names of the distinct values.
>
> City
> A
> B
> C
> A
> C
> C
>
> I want to convert it into
>
> A             B            C
> 1              0            0
> 0              1            0
> 0              0            1
> 1              0            0
> 0              0            1
> 0              0            1
> how can the UDF return a schema containing the names of the cities? is it
> possible?
> I should be able to generate A rather than generate $0.
> Thanks,
> Austin
>
> On Tue, Feb 21, 2012 at 10:23 AM, Eli Finkelshteyn<[EMAIL PROTECTED]>wrote:
>
>> Interesting problem. What I'm thinking is why not do two steps. First,
>> read in the data, group on the column you care about. Then generate on it
>> so you get just the distinct values for that column left. This would be
>> something like:
>>
>> CITIES_GROUPED=  GROUP  INITIALBY  city;
>> CITIES=  FOREACHCITIES_GROUPED GENERATE group AS city;
>>
>>
>> Once you have that, convert it to a tuple, and then just write a quick udf
>> that goes through the ORIGINAL data set and takes in the row value for the
>> column you care about along with the distinct values tuple you just created
>> as parameters and returns a tuple of 0s and one 1 where the one is in the
>> position in the distinct values tuple that matches the row value for that
>> row for the column you care about. You could write that udf in Java,
>> Python, or one of the other supported udf languages, depending on your
>> requirements.
>>
>> For inputting, you could do it either through a simple bash script (your
>> use case is simple enough, I think), or you could go ahead and embed the
>> PIG script in Java, Python, or one of the other languages that's supported
>> for that functionality, so it's easy to expand if you later need to. I'm
>> personally partial to Python and have had great results embedding in that.
>> Just make sure you're on Pig 9.1+.
>>
>> Hopefully that helps,
>> Eli
>>
>>
>> On 2/20/12 6:56 AM, Prashant Kommireddi wrote:
>>
>>> This should work if the values are only A,B,C.
>>>
>>> M = load 'input' as (city:chararray);
>>>
>>> N = foreach M generate city == 'A' ? 1 : 0 as A, city == 'B' ? 1 : 0 as B,
>>> city == 'C' ? 1 : 0 as C;
>>>
>>> However, if city values vary it might be a good option to do it by
>>> embedding Pig in Java.
>>> http://pig.apache.org/docs/r0.**9.1/cont.html#embed-java<http://pig.apache.org/docs/r0.9.1/cont.html#embed-java>
>>>
>>> Thanks,
>>> Prashant
>>>
>>> On Mon, Feb 20, 2012 at 3:16 AM, Austin Chungath<[EMAIL PROTECTED]>
>>>   wrote:
>>>
>>> Consider this scenario:
>>>> I have a column named City and it takes 3 possible values: A,B,C
>>>>
>>>> City
>>>> A
>>>> B
>>>> C
>>>> A
>>>> C
>>>> C
>>>>
>>>> I want to convert it into
>>>>
>>>> A             B            C
>>>> 1              0            0
>>>> 0              1            0
>>>> 0              0            1
>>>> 1              0            0
>>>> 0              0            1
>>>> 0              0            1
>>>>
>>>> I am trying to write a pig script that will take two parameters, one
>>>> parameter is the data and then the column name, in this case 'City'. The
>>>> script should then identify distinct values that it will take and then
>>>> create that many columns and populate it with 1 or 0 depending on which
>>>> one
>>>> is true.
>>>> Please let me know if you have got any ideas on how to approach this