Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> FLATTEN eats null rows?

Dexin Wang 2010-12-31, 00:35
Copy link to this message
Re: FLATTEN eats null rows?

On 12/30/10 4:35 PM, "Dexin Wang" <[EMAIL PROTECTED]> wrote:

> Seems after FLATTEN, the rows with null values get dropped.
What you are seeing is the expected/documented behavior of flatten -
"Note that the flatten of empty bag will result in that row being discarded"
(Note that its 'empty bag' not 'null').
> You see if I do FLATTEN, all the rows with null values are all missing (in
> D). If I don't do FLATTEN, as in E, I have all the rows but not flattened,
> obviously. What I want as the end result is:
> (1,a,b,x)
> (2,c,d,y)
> (3,e,f,{})
> (6,,,z)
> (8,,,w)
> How can I get that? Thanks.

 D = FOREACH C GENERATE group,  FLATTEN((IsEmpty(A) ? null : A.(f1,f2) )),
FLATTEN((IsEmpty(B) ? null : B.f3));

> I realize I could do FULL JOIN, but the problem is that after join, I
> wouldn't know which id is null, I would have to do many if then in the
> following generate command and I hope I can avoid that. E.g.,
> C = JOIN A BY id FULL, B BY id;
> (1,a,b,1,x)
> (2,c,d,2,y)
> (3,e,f,,)
> (,,,6,z)
> (,,,8,w)

> C: {A::id: bytearray,A::f1: bytearray,A::f2: bytearray,B::id:
> bytearray,B::f3: bytearray}
> Sometimes A::id is null, sometimes B::id null, I always only want the
> non-null id in my output.

You can get this by using the conditional expression (called bincond in pig
documents) (? : ).

E = foreach C generate (A::id is null ? B::id : A::id), A::F1, A::F2, B::F3;

Dexin Wang 2010-12-31, 04:02