Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> filter out null lines returned by UDF


+
Dexin Wang 2012-03-02, 00:45
+
Jonathan Coveney 2012-03-02, 01:14
Copy link to this message
-
Re: filter out null lines returned by UDF
yeah. That works great. Thanks you Jonathan.

On Thu, Mar 1, 2012 at 5:14 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote:

> FLATTEN is kind of quirky. If you FLATTEN(null), it will return null, but
> if you FLATTEN a bag that is empty (ie size=0), it will throw away the row.
> I would have your UDF return an empty bag and let the flatten wipe it out.
>
> 2012/3/1 Dexin Wang <[EMAIL PROTECTED]>
>
> > Hi,
> >
> > I have a UDF that parses a line and then return a bag, and sometimes the
> > line is bad so I'm returning null in the UDF. In my pig script, I'd like
> to
> > filter those nulls like this:
> >
> > raw = LOAD 'raw_input' AS (line:chararray);
> > parsed = FOREACH raw GENERATE FLATTEN(MyUDF(line));    -- get two fields
> in
> > the tuple: id and name
> > DUMP parsed;
> >
> >   (id1,name1)
> >   (id2,name2)
> >   ()
> >   (id3,name3)
> >
> > parsed_no_nulls = FILTER parsed BY id IS NOT NULL;
> > DUMP parsed_no_nulls;
> >
> >   (id1,name1)
> >   (id2,name2)
> >   (id3,name3)
> >
> > This works, but I'm getting this warning:
> >
> >  WARN
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger
> > -
> >
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
> > Attempt to access field which was not found in the input
> >
> > When I try to use IsEmpty to filter, I get this error "Cannot test a NULL
> > for emptiness".
> >
> > What's the correct way to filter out these null bags returned from my
> UDF?
> >
> > Thanks.
> > Dexin
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB