Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> filter out null lines returned by UDF


Copy link to this message
-
filter out null lines returned by UDF
Hi,

I have a UDF that parses a line and then return a bag, and sometimes the
line is bad so I'm returning null in the UDF. In my pig script, I'd like to
filter those nulls like this:

raw = LOAD 'raw_input' AS (line:chararray);
parsed = FOREACH raw GENERATE FLATTEN(MyUDF(line));    -- get two fields in
the tuple: id and name
DUMP parsed;

   (id1,name1)
   (id2,name2)
   ()
   (id3,name3)

parsed_no_nulls = FILTER parsed BY id IS NOT NULL;
DUMP parsed_no_nulls;

   (id1,name1)
   (id2,name2)
   (id3,name3)

This works, but I'm getting this warning:

 WARN
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger
-
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
Attempt to access field which was not found in the input

When I try to use IsEmpty to filter, I get this error "Cannot test a NULL
for emptiness".

What's the correct way to filter out these null bags returned from my UDF?

Thanks.
Dexin
+
Jonathan Coveney 2012-03-02, 01:14
+
Dexin Wang 2012-03-07, 23:08
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB