Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> removing lines with missing values


Copy link to this message
-
Re: removing lines with missing values
Hi Jamal,

If any fields are empty in an input file, they will be loaded as nulls by
Pig. For example,

,f2,f3,,f5

will be loaded as

(null,f2,f3,null,f5) // when you dump it, it will be printed as
(,f2,f3,,f5).

Now you can use COUNT(*) to count the number of non-null elements in a bag
and use a condition "COUNT(*) != total # of columns" to filter out rows
that have nulls. Here is an example:

a = LOAD '1.txt' USING PigStorage(',') AS (i,j,k);
b = FOREACH a GENERATE *, TOBAG(*) AS aBag;
c = FILTER b BY COUNT(aBag) == 3;
d = FOREACH c GENERATE i,j,k;
DUMP d;

With the following input file:

1,2,3
,2,3
1,,3

This gives me:

(1,2,3)

Alternatively, you can easily write a UDF that takes a tuple or a bag and
checks whether any element is null.

Thanks,
Cheolsoo

On Wed, Nov 14, 2012 at 10:03 AM, jamal sasha <[EMAIL PROTECTED]> wrote:

> Hi
> I have dataset in some form
>
> F1, f2......fn
>
> Now sometimes f1 is empty sometimes f2 and so on
> Basically what I want is anytime any field is empty ignore that entry.
> Now one way to do is using filter f1!='' and so on.
> But that would be an ugly statement.
> Is there a better way to do this
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB