|
|
-
Re: removing lines with missing valuesCheolsoo Park 2012-11-15, 18:01
Hi Jamal,
If any fields are empty in an input file, they will be loaded as nulls by Pig. For example, ,f2,f3,,f5 will be loaded as (null,f2,f3,null,f5) // when you dump it, it will be printed as (,f2,f3,,f5). Now you can use COUNT(*) to count the number of non-null elements in a bag and use a condition "COUNT(*) != total # of columns" to filter out rows that have nulls. Here is an example: a = LOAD '1.txt' USING PigStorage(',') AS (i,j,k); b = FOREACH a GENERATE *, TOBAG(*) AS aBag; c = FILTER b BY COUNT(aBag) == 3; d = FOREACH c GENERATE i,j,k; DUMP d; With the following input file: 1,2,3 ,2,3 1,,3 This gives me: (1,2,3) Alternatively, you can easily write a UDF that takes a tuple or a bag and checks whether any element is null. Thanks, Cheolsoo On Wed, Nov 14, 2012 at 10:03 AM, jamal sasha <[EMAIL PROTECTED]> wrote: > Hi > I have dataset in some form > > F1, f2......fn > > Now sometimes f1 is empty sometimes f2 and so on > Basically what I want is anytime any field is empty ignore that entry. > Now one way to do is using filter f1!='' and so on. > But that would be an ugly statement. > Is there a better way to do this > |