|
|
-
filter out null lines returned by UDF
Dexin Wang 2012-03-02, 00:45
Hi,
I have a UDF that parses a line and then return a bag, and sometimes the line is bad so I'm returning null in the UDF. In my pig script, I'd like to filter those nulls like this:
raw = LOAD 'raw_input' AS (line:chararray); parsed = FOREACH raw GENERATE FLATTEN(MyUDF(line)); -- get two fields in the tuple: id and name DUMP parsed;
(id1,name1) (id2,name2) () (id3,name3)
parsed_no_nulls = FILTER parsed BY id IS NOT NULL; DUMP parsed_no_nulls;
(id1,name1) (id2,name2) (id3,name3)
This works, but I'm getting this warning:
WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input
When I try to use IsEmpty to filter, I get this error "Cannot test a NULL for emptiness".
What's the correct way to filter out these null bags returned from my UDF?
Thanks. Dexin
+
Dexin Wang 2012-03-02, 00:45
-
Re: filter out null lines returned by UDF
Jonathan Coveney 2012-03-02, 01:14
FLATTEN is kind of quirky. If you FLATTEN(null), it will return null, but if you FLATTEN a bag that is empty (ie size=0), it will throw away the row. I would have your UDF return an empty bag and let the flatten wipe it out.
2012/3/1 Dexin Wang <[EMAIL PROTECTED]>
> Hi, > > I have a UDF that parses a line and then return a bag, and sometimes the > line is bad so I'm returning null in the UDF. In my pig script, I'd like to > filter those nulls like this: > > raw = LOAD 'raw_input' AS (line:chararray); > parsed = FOREACH raw GENERATE FLATTEN(MyUDF(line)); -- get two fields in > the tuple: id and name > DUMP parsed; > > (id1,name1) > (id2,name2) > () > (id3,name3) > > parsed_no_nulls = FILTER parsed BY id IS NOT NULL; > DUMP parsed_no_nulls; > > (id1,name1) > (id2,name2) > (id3,name3) > > This works, but I'm getting this warning: > > WARN > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger > - > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: > Attempt to access field which was not found in the input > > When I try to use IsEmpty to filter, I get this error "Cannot test a NULL > for emptiness". > > What's the correct way to filter out these null bags returned from my UDF? > > Thanks. > Dexin >
+
Jonathan Coveney 2012-03-02, 01:14
-
Re: filter out null lines returned by UDF
Dexin Wang 2012-03-07, 23:08
yeah. That works great. Thanks you Jonathan.
On Thu, Mar 1, 2012 at 5:14 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote:
> FLATTEN is kind of quirky. If you FLATTEN(null), it will return null, but > if you FLATTEN a bag that is empty (ie size=0), it will throw away the row. > I would have your UDF return an empty bag and let the flatten wipe it out. > > 2012/3/1 Dexin Wang <[EMAIL PROTECTED]> > > > Hi, > > > > I have a UDF that parses a line and then return a bag, and sometimes the > > line is bad so I'm returning null in the UDF. In my pig script, I'd like > to > > filter those nulls like this: > > > > raw = LOAD 'raw_input' AS (line:chararray); > > parsed = FOREACH raw GENERATE FLATTEN(MyUDF(line)); -- get two fields > in > > the tuple: id and name > > DUMP parsed; > > > > (id1,name1) > > (id2,name2) > > () > > (id3,name3) > > > > parsed_no_nulls = FILTER parsed BY id IS NOT NULL; > > DUMP parsed_no_nulls; > > > > (id1,name1) > > (id2,name2) > > (id3,name3) > > > > This works, but I'm getting this warning: > > > > WARN > > > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger > > - > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: > > Attempt to access field which was not found in the input > > > > When I try to use IsEmpty to filter, I get this error "Cannot test a NULL > > for emptiness". > > > > What's the correct way to filter out these null bags returned from my > UDF? > > > > Thanks. > > Dexin > > >
+
Dexin Wang 2012-03-07, 23:08
|
|