Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - removing lines with missing values


+
jamal sasha 2012-11-14, 18:03
Copy link to this message
-
Re: removing lines with missing values
Cheolsoo Park 2012-11-15, 18:01
Hi Jamal,

If any fields are empty in an input file, they will be loaded as nulls by
Pig. For example,

,f2,f3,,f5

will be loaded as

(null,f2,f3,null,f5) // when you dump it, it will be printed as
(,f2,f3,,f5).

Now you can use COUNT(*) to count the number of non-null elements in a bag
and use a condition "COUNT(*) != total # of columns" to filter out rows
that have nulls. Here is an example:

a = LOAD '1.txt' USING PigStorage(',') AS (i,j,k);
b = FOREACH a GENERATE *, TOBAG(*) AS aBag;
c = FILTER b BY COUNT(aBag) == 3;
d = FOREACH c GENERATE i,j,k;
DUMP d;

With the following input file:

1,2,3
,2,3
1,,3

This gives me:

(1,2,3)

Alternatively, you can easily write a UDF that takes a tuple or a bag and
checks whether any element is null.

Thanks,
Cheolsoo

On Wed, Nov 14, 2012 at 10:03 AM, jamal sasha <[EMAIL PROTECTED]> wrote:

> Hi
> I have dataset in some form
>
> F1, f2......fn
>
> Now sometimes f1 is empty sometimes f2 and so on
> Basically what I want is anytime any field is empty ignore that entry.
> Now one way to do is using filter f1!='' and so on.
> But that would be an ugly statement.
> Is there a better way to do this
>