Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> FLATTEN eats null rows?


Copy link to this message
-
Re: FLATTEN eats null rows?
Thanks. Both worked fine.

I think I'll make a MyFlatten that doesn't drop the empty bag. Say you want
to COGROUP 3 or more bags, you would have to do a many COGROUP or JOIN, then
do IsEmpty or bincond every time. Istead, with MyFlatten, I would do:

X = COGROUP A BY id, B BY id, C BY id, D BY id;
Y = FOREACH X GENERATE group, FLATTEN(A.(f1, f2)), FLATTEN(B.(f3,f4,f5)),
FLATTEN(C.f6), FLATTEN(D.f7);

code will be a lot conciser and cleaner.

On Thu, Dec 30, 2010 at 6:46 PM, Thejas M Nair <[EMAIL PROTECTED]> wrote:

>
>
>
> On 12/30/10 4:35 PM, "Dexin Wang" <[EMAIL PROTECTED]> wrote:
>
> > Seems after FLATTEN, the rows with null values get dropped.
> >
> What you are seeing is the expected/documented behavior of flatten -
> http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Flatten+Operator
> "Note that the flatten of empty bag will result in that row being
> discarded"
> (Note that its 'empty bag' not 'null').
>
>
> >
> > You see if I do FLATTEN, all the rows with null values are all missing
> (in
> > D). If I don't do FLATTEN, as in E, I have all the rows but not
> flattened,
> > obviously. What I want as the end result is:
> >
> > (1,a,b,x)
> > (2,c,d,y)
> > (3,e,f,{})
> > (6,,,z)
> > (8,,,w)
> >
> > How can I get that? Thanks.
> >
>
>  D = FOREACH C GENERATE group,  FLATTEN((IsEmpty(A) ? null : A.(f1,f2) )),
> FLATTEN((IsEmpty(B) ? null : B.f3));
>
>
> > I realize I could do FULL JOIN, but the problem is that after join, I
> > wouldn't know which id is null, I would have to do many if then in the
> > following generate command and I hope I can avoid that. E.g.,
> >
> > C = JOIN A BY id FULL, B BY id;
> > DUMP C
> > (1,a,b,1,x)
> > (2,c,d,2,y)
> > (3,e,f,,)
> > (,,,6,z)
> > (,,,8,w)
> >
>
>
>
> > DESCRIBE C;
> > C: {A::id: bytearray,A::f1: bytearray,A::f2: bytearray,B::id:
> > bytearray,B::f3: bytearray}
> >
> > Sometimes A::id is null, sometimes B::id null, I always only want the
> > non-null id in my output.
> >
>
> You can get this by using the conditional expression (called bincond in pig
> documents) (? : ).
>
> E = foreach C generate (A::id is null ? B::id : A::id), A::F1, A::F2,
> B::F3;
>
> -Thejas
>
>
>
>