-Re: removing lines with missing values
Cheolsoo Park 2012-11-15, 18:01
If any fields are empty in an input file, they will be loaded as nulls by
Pig. For example,
will be loaded as
(null,f2,f3,null,f5) // when you dump it, it will be printed as
Now you can use COUNT(*) to count the number of non-null elements in a bag
and use a condition "COUNT(*) != total # of columns" to filter out rows
that have nulls. Here is an example:
a = LOAD '1.txt' USING PigStorage(',') AS (i,j,k);
b = FOREACH a GENERATE *, TOBAG(*) AS aBag;
c = FILTER b BY COUNT(aBag) == 3;
d = FOREACH c GENERATE i,j,k;
With the following input file:
This gives me:
Alternatively, you can easily write a UDF that takes a tuple or a bag and
checks whether any element is null.
On Wed, Nov 14, 2012 at 10:03 AM, jamal sasha <[EMAIL PROTECTED]> wrote:
> I have dataset in some form
> F1, f2......fn
> Now sometimes f1 is empty sometimes f2 and so on
> Basically what I want is anytime any field is empty ignore that entry.
> Now one way to do is using filter f1!='' and so on.
> But that would be an ugly statement.
> Is there a better way to do this