|
|
-
Re: Pig script to convert Categorical variablesEli Finkelshteyn 2012-02-21, 16:17
No problem. Returning a variable schema sounds pretty cool and like
something that should be doable, but I'm not really sure how to go about it. Maybe someone else knows? Eli On 2/21/12 1:27 AM, Austin Chungath wrote: > Thanks Eli, > That helps and it was exactly what I was doing. I wrote the UDF and it is > working. > I wrote a UDF that takes two parameters, first parameter was a bag of > tuples containing distinct values (ordered ascending) and the second > parameter is the original data set. It is working but now I am trying > to figure out how I can return a schema for the columns created with the > names of the distinct values. > > City > A > B > C > A > C > C > > I want to convert it into > > A B C > 1 0 0 > 0 1 0 > 0 0 1 > 1 0 0 > 0 0 1 > 0 0 1 > how can the UDF return a schema containing the names of the cities? is it > possible? > I should be able to generate A rather than generate $0. > Thanks, > Austin > > On Tue, Feb 21, 2012 at 10:23 AM, Eli Finkelshteyn<[EMAIL PROTECTED]>wrote: > >> Interesting problem. What I'm thinking is why not do two steps. First, >> read in the data, group on the column you care about. Then generate on it >> so you get just the distinct values for that column left. This would be >> something like: >> >> CITIES_GROUPED= GROUP INITIALBY city; >> CITIES= FOREACHCITIES_GROUPED GENERATE group AS city; >> >> >> Once you have that, convert it to a tuple, and then just write a quick udf >> that goes through the ORIGINAL data set and takes in the row value for the >> column you care about along with the distinct values tuple you just created >> as parameters and returns a tuple of 0s and one 1 where the one is in the >> position in the distinct values tuple that matches the row value for that >> row for the column you care about. You could write that udf in Java, >> Python, or one of the other supported udf languages, depending on your >> requirements. >> >> For inputting, you could do it either through a simple bash script (your >> use case is simple enough, I think), or you could go ahead and embed the >> PIG script in Java, Python, or one of the other languages that's supported >> for that functionality, so it's easy to expand if you later need to. I'm >> personally partial to Python and have had great results embedding in that. >> Just make sure you're on Pig 9.1+. >> >> Hopefully that helps, >> Eli >> >> >> On 2/20/12 6:56 AM, Prashant Kommireddi wrote: >> >>> This should work if the values are only A,B,C. >>> >>> M = load 'input' as (city:chararray); >>> >>> N = foreach M generate city == 'A' ? 1 : 0 as A, city == 'B' ? 1 : 0 as B, >>> city == 'C' ? 1 : 0 as C; >>> >>> However, if city values vary it might be a good option to do it by >>> embedding Pig in Java. >>> http://pig.apache.org/docs/r0.**9.1/cont.html#embed-java<http://pig.apache.org/docs/r0.9.1/cont.html#embed-java> >>> >>> Thanks, >>> Prashant >>> >>> On Mon, Feb 20, 2012 at 3:16 AM, Austin Chungath<[EMAIL PROTECTED]> >>> wrote: >>> >>> Consider this scenario: >>>> I have a column named City and it takes 3 possible values: A,B,C >>>> >>>> City >>>> A >>>> B >>>> C >>>> A >>>> C >>>> C >>>> >>>> I want to convert it into >>>> >>>> A B C >>>> 1 0 0 >>>> 0 1 0 >>>> 0 0 1 >>>> 1 0 0 >>>> 0 0 1 >>>> 0 0 1 >>>> >>>> I am trying to write a pig script that will take two parameters, one >>>> parameter is the data and then the column name, in this case 'City'. The >>>> script should then identify distinct values that it will take and then >>>> create that many columns and populate it with 1 or 0 depending on which >>>> one >>>> is true. >>>> Please let me know if you have got any ideas on how to approach this |