Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> removing lines with missing values


+
jamal sasha 2012-11-14, 18:03
Copy link to this message
-
Re: removing lines with missing values
Hi Jamal,

If any fields are empty in an input file, they will be loaded as nulls by
Pig. For example,

,f2,f3,,f5

will be loaded as

(null,f2,f3,null,f5) // when you dump it, it will be printed as
(,f2,f3,,f5).

Now you can use COUNT(*) to count the number of non-null elements in a bag
and use a condition "COUNT(*) != total # of columns" to filter out rows
that have nulls. Here is an example:

a = LOAD '1.txt' USING PigStorage(',') AS (i,j,k);
b = FOREACH a GENERATE *, TOBAG(*) AS aBag;
c = FILTER b BY COUNT(aBag) == 3;
d = FOREACH c GENERATE i,j,k;
DUMP d;

With the following input file:

1,2,3
,2,3
1,,3

This gives me:

(1,2,3)

Alternatively, you can easily write a UDF that takes a tuple or a bag and
checks whether any element is null.

Thanks,
Cheolsoo

On Wed, Nov 14, 2012 at 10:03 AM, jamal sasha <[EMAIL PROTECTED]> wrote:

> Hi
> I have dataset in some form
>
> F1, f2......fn
>
> Now sometimes f1 is empty sometimes f2 and so on
> Basically what I want is anytime any field is empty ignore that entry.
> Now one way to do is using filter f1!='' and so on.
> But that would be an ugly statement.
> Is there a better way to do this
>