Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> filter out null lines returned by UDF


Copy link to this message
-
filter out null lines returned by UDF
Hi,

I have a UDF that parses a line and then return a bag, and sometimes the
line is bad so I'm returning null in the UDF. In my pig script, I'd like to
filter those nulls like this:

raw = LOAD 'raw_input' AS (line:chararray);
parsed = FOREACH raw GENERATE FLATTEN(MyUDF(line));    -- get two fields in
the tuple: id and name
DUMP parsed;

   (id1,name1)
   (id2,name2)
   ()
   (id3,name3)

parsed_no_nulls = FILTER parsed BY id IS NOT NULL;
DUMP parsed_no_nulls;

   (id1,name1)
   (id2,name2)
   (id3,name3)

This works, but I'm getting this warning:

 WARN
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger
-
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
Attempt to access field which was not found in the input

When I try to use IsEmpty to filter, I get this error "Cannot test a NULL
for emptiness".

What's the correct way to filter out these null bags returned from my UDF?

Thanks.
Dexin