|
|
-
Replace null with string
Mario Lassnig 2012-06-07, 11:37
Hello,
I'm having a lot of null entries in my data. Due to later processing it would be very helpful if I could set a default value for null to be the string "other". I couldn't find a way to do this (version 0.8.1-cdh3u4)
Also, I have some variables in my GENERATE statements that can potentially return null, and I would need something similar to the SQL DECODE function to get the "other" string instead of null.
Example:
tmp = FOREACH dump GENERATE site, REGEX_EXTRACT(name, '^(?:([^.]+)\\.?){1}', 1) AS project, ((ami MATCHES '.*datatype.*') ? REGEX_EXTRACT(name, '^(?:([^.]+)\\.?){5}', 1) : 'other') AS datatype, ami, duid, nbfiles, length, rnbfiles, rlength, name;
Here: 'site' and 'datatype' could return an empty string (which is valid) and is interpreted as null, but should be "other" instead.
Thanks a lot, Mario
-
Re: Replace null with string
Dmitriy Ryaboy 2012-06-07, 13:20
Nonulls = foreach somenulls generate (field == null ? 'other' : field) as field;
On Jun 7, 2012, at 4:37 AM, Mario Lassnig <[EMAIL PROTECTED]> wrote:
> Hello, > > I'm having a lot of null entries in my data. Due to later processing it would be very helpful if I could set a default value for null to be the string "other". I couldn't find a way to do this (version 0.8.1-cdh3u4) > > Also, I have some variables in my GENERATE statements that can potentially return null, and I would need something similar to the SQL DECODE function to get the "other" string instead of null. > > Example: > > tmp = FOREACH dump GENERATE site, REGEX_EXTRACT(name, '^(?:([^.]+)\\.?){1}', 1) AS project, ((ami MATCHES '.*datatype.*') ? REGEX_EXTRACT(name, '^(?:([^.]+)\\.?){5}', 1) : 'other') AS datatype, ami, duid, nbfiles, length, rnbfiles, rlength, name; > > Here: 'site' and 'datatype' could return an empty string (which is valid) and is interpreted as null, but should be "other" instead. > > Thanks a lot, > Mario
-
Re: Replace null with string
Dragan Nedeljkovic 2012-06-08, 22:00
You can use an UDF like the one bellow to deal with the NULLs.
register 'mypiggybank.jar'; define Nvl piggybank.Nvl();
input_lines = LOAD 'test_Nvl.in' AS (line:chararray); describe input_lines; dump input_lines;
new_list = FOREACH input_lines GENERATE Nvl(line, 'n/a'); describe new_list; dump new_list;
-- eof
package piggybank;
import java.io.IOException;
import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple;
public class Nvl extends EvalFunc<String> { public String exec(Tuple inputA) throws IOException { try { String inputVal = (String)inputA.get(0); String defaultVal = (String)inputA.get(1); if(inputVal == null || inputVal.length() == 0) { return defaultVal; }
return inputVal; } catch(Exception e) { // Throwing an exception will cause the task to fail. throw new IOException("Something bad happened!", e); } } } // eof >________________________________ > From: Mario Lassnig <[EMAIL PROTECTED]> >To: [EMAIL PROTECTED] >Sent: Thursday, June 7, 2012 7:37:36 AM >Subject: Replace null with string > >Hello, > >I'm having a lot of null entries in my data. Due to later processing it would be very helpful if I could set a default value for null to be the string "other". I couldn't find a way to do this (version 0.8.1-cdh3u4) > >Also, I have some variables in my GENERATE statements that can potentially return null, and I would need something similar to the SQL DECODE function to get the "other" string instead of null. > >Example: > >tmp = FOREACH dump GENERATE site, REGEX_EXTRACT(name, '^(?:([^.]+)\\.?){1}', 1) AS project, ((ami MATCHES '.*datatype.*') ? REGEX_EXTRACT(name, '^(?:([^.]+)\\.?){5}', 1) : 'other') AS datatype, ami, duid, nbfiles, length, rnbfiles, rlength, name; > >Here: 'site' and 'datatype' could return an empty string (which is valid) and is interpreted as null, but should be "other" instead. > >Thanks a lot, >Mario > > >
-
Re: Replace null with string
Russell Jurney 2012-06-08, 23:30
fixed = FOREACH my_relation GENERATE (my_field IS NOT NULL ? my_field : 'other') as my_field, *;
Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
On Jun 8, 2012, at 3:01 PM, Dragan Nedeljkovic <[EMAIL PROTECTED]> wrote:
> You can use an UDF like the one bellow to deal with the NULLs. > > register 'mypiggybank.jar'; > define Nvl piggybank.Nvl(); > > input_lines = LOAD 'test_Nvl.in' AS (line:chararray); > describe input_lines; > dump input_lines; > > new_list = FOREACH input_lines GENERATE Nvl(line, 'n/a'); > describe new_list; > dump new_list; > > -- eof > > package piggybank; > > import java.io.IOException; > > import org.apache.pig.EvalFunc; > import org.apache.pig.data.Tuple; > > public class Nvl > extends EvalFunc<String> > { > public String exec(Tuple inputA) > throws IOException > { > try > { > String inputVal = (String)inputA.get(0); > String defaultVal = (String)inputA.get(1); > > if(inputVal == null || inputVal.length() == 0) > { > return defaultVal; > } > > return inputVal; > } > catch(Exception e) > { > // Throwing an exception will cause the task to fail. > throw new IOException("Something bad happened!", e); > } > } > } > // eof > > > > >> ________________________________ >> From: Mario Lassnig <[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED] >> Sent: Thursday, June 7, 2012 7:37:36 AM >> Subject: Replace null with string >> >> Hello, >> >> I'm having a lot of null entries in my data. Due to later processing it would be very helpful if I could set a default value for null to be the string "other". I couldn't find a way to do this (version 0.8.1-cdh3u4) >> >> Also, I have some variables in my GENERATE statements that can potentially return null, and I would need something similar to the SQL DECODE function to get the "other" string instead of null. >> >> Example: >> >> tmp = FOREACH dump GENERATE site, REGEX_EXTRACT(name, '^(?:([^.]+)\\.?){1}', 1) AS project, ((ami MATCHES '.*datatype.*') ? REGEX_EXTRACT(name, '^(?:([^.]+)\\.?){5}', 1) : 'other') AS datatype, ami, duid, nbfiles, length, rnbfiles, rlength, name; >> >> Here: 'site' and 'datatype' could return an empty string (which is valid) and is interpreted as null, but should be "other" instead. >> >> Thanks a lot, >> Mario >> >>
|
|