Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> FLATTEN eats null rows?


Copy link to this message
-
Re: FLATTEN eats null rows?

On 12/30/10 4:35 PM, "Dexin Wang" <[EMAIL PROTECTED]> wrote:

> Seems after FLATTEN, the rows with null values get dropped.
>
What you are seeing is the expected/documented behavior of flatten -
http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Flatten+Operator
"Note that the flatten of empty bag will result in that row being discarded"
(Note that its 'empty bag' not 'null').
>
> You see if I do FLATTEN, all the rows with null values are all missing (in
> D). If I don't do FLATTEN, as in E, I have all the rows but not flattened,
> obviously. What I want as the end result is:
>
> (1,a,b,x)
> (2,c,d,y)
> (3,e,f,{})
> (6,,,z)
> (8,,,w)
>
> How can I get that? Thanks.
>

 D = FOREACH C GENERATE group,  FLATTEN((IsEmpty(A) ? null : A.(f1,f2) )),
FLATTEN((IsEmpty(B) ? null : B.f3));
 

> I realize I could do FULL JOIN, but the problem is that after join, I
> wouldn't know which id is null, I would have to do many if then in the
> following generate command and I hope I can avoid that. E.g.,
>
> C = JOIN A BY id FULL, B BY id;
> DUMP C
> (1,a,b,1,x)
> (2,c,d,2,y)
> (3,e,f,,)
> (,,,6,z)
> (,,,8,w)
>

> DESCRIBE C;
> C: {A::id: bytearray,A::f1: bytearray,A::f2: bytearray,B::id:
> bytearray,B::f3: bytearray}
>
> Sometimes A::id is null, sometimes B::id null, I always only want the
> non-null id in my output.
>

You can get this by using the conditional expression (called bincond in pig
documents) (? : ).

E = foreach C generate (A::id is null ? B::id : A::id), A::F1, A::F2, B::F3;

-Thejas