|
|
-
FLATTEN eats null rows?
Dexin Wang 2010-12-31, 00:35
Seems after FLATTEN, the rows with null values get dropped.
I have two test files:
% cat test1.txt 1 a b 2 c d 3 e f
% cat test2.txt 1 x 2 y 6 z 8 w
I'm trying to cogroup the two on the first column:
A = LOAD 'test1.txt' AS (id, f1, f2); B = LOAD 'test2.txt' AS (id, f3); C = COGROUP A BY id, B BY id; DUMP C;
(1,{(1,a,b)},{(1,x)}) (2,{(2,c,d)},{(2,y)}) (3,{(3,e,f)},{}) (6,{},{(6,z)}) (8,{},{(8,w)})
D = FOREACH C GENERATE group, FLATTEN(A.(f1, f2)), FLATTEN(B.f3); DUMP D;
(1,a,b,x) (2,c,d,y)
E = FOREACH C GENERATE group, A.(f1, f2), B.f3; DUMP E
(1,{(a,b)},{(x)}) (2,{(c,d)},{(y)}) (3,{(e,f)},{}) (6,{},{(z)}) (8,{},{(w)})
You see if I do FLATTEN, all the rows with null values are all missing (in D). If I don't do FLATTEN, as in E, I have all the rows but not flattened, obviously. What I want as the end result is:
(1,a,b,x) (2,c,d,y) (3,e,f,{}) (6,,,z) (8,,,w)
How can I get that? Thanks.
Dexin
P.S.
I realize I could do FULL JOIN, but the problem is that after join, I wouldn't know which id is null, I would have to do many if then in the following generate command and I hope I can avoid that. E.g.,
C = JOIN A BY id FULL, B BY id; DUMP C (1,a,b,1,x) (2,c,d,2,y) (3,e,f,,) (,,,6,z) (,,,8,w)
DESCRIBE C; C: {A::id: bytearray,A::f1: bytearray,A::f2: bytearray,B::id: bytearray,B::f3: bytearray}
Sometimes A::id is null, sometimes B::id null, I always only want the non-null id in my output.
+
Dexin Wang 2010-12-31, 00:35
-
Re: FLATTEN eats null rows?
Thejas M Nair 2010-12-31, 02:46
On 12/30/10 4:35 PM, "Dexin Wang" <[EMAIL PROTECTED]> wrote: > Seems after FLATTEN, the rows with null values get dropped. > What you are seeing is the expected/documented behavior of flatten - http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Flatten+Operator"Note that the flatten of empty bag will result in that row being discarded" (Note that its 'empty bag' not 'null'). > > You see if I do FLATTEN, all the rows with null values are all missing (in > D). If I don't do FLATTEN, as in E, I have all the rows but not flattened, > obviously. What I want as the end result is: > > (1,a,b,x) > (2,c,d,y) > (3,e,f,{}) > (6,,,z) > (8,,,w) > > How can I get that? Thanks. > D = FOREACH C GENERATE group, FLATTEN((IsEmpty(A) ? null : A.(f1,f2) )), FLATTEN((IsEmpty(B) ? null : B.f3)); > I realize I could do FULL JOIN, but the problem is that after join, I > wouldn't know which id is null, I would have to do many if then in the > following generate command and I hope I can avoid that. E.g., > > C = JOIN A BY id FULL, B BY id; > DUMP C > (1,a,b,1,x) > (2,c,d,2,y) > (3,e,f,,) > (,,,6,z) > (,,,8,w) > > DESCRIBE C; > C: {A::id: bytearray,A::f1: bytearray,A::f2: bytearray,B::id: > bytearray,B::f3: bytearray} > > Sometimes A::id is null, sometimes B::id null, I always only want the > non-null id in my output. > You can get this by using the conditional expression (called bincond in pig documents) (? : ). E = foreach C generate (A::id is null ? B::id : A::id), A::F1, A::F2, B::F3; -Thejas
+
Thejas M Nair 2010-12-31, 02:46
-
Re: FLATTEN eats null rows?
Dexin Wang 2010-12-31, 04:02
Thanks. Both worked fine. I think I'll make a MyFlatten that doesn't drop the empty bag. Say you want to COGROUP 3 or more bags, you would have to do a many COGROUP or JOIN, then do IsEmpty or bincond every time. Istead, with MyFlatten, I would do: X = COGROUP A BY id, B BY id, C BY id, D BY id; Y = FOREACH X GENERATE group, FLATTEN(A.(f1, f2)), FLATTEN(B.(f3,f4,f5)), FLATTEN(C.f6), FLATTEN(D.f7); code will be a lot conciser and cleaner. On Thu, Dec 30, 2010 at 6:46 PM, Thejas M Nair <[EMAIL PROTECTED]> wrote: > > > > On 12/30/10 4:35 PM, "Dexin Wang" <[EMAIL PROTECTED]> wrote: > > > Seems after FLATTEN, the rows with null values get dropped. > > > What you are seeing is the expected/documented behavior of flatten - > http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Flatten+Operator> "Note that the flatten of empty bag will result in that row being > discarded" > (Note that its 'empty bag' not 'null'). > > > > > > You see if I do FLATTEN, all the rows with null values are all missing > (in > > D). If I don't do FLATTEN, as in E, I have all the rows but not > flattened, > > obviously. What I want as the end result is: > > > > (1,a,b,x) > > (2,c,d,y) > > (3,e,f,{}) > > (6,,,z) > > (8,,,w) > > > > How can I get that? Thanks. > > > > D = FOREACH C GENERATE group, FLATTEN((IsEmpty(A) ? null : A.(f1,f2) )), > FLATTEN((IsEmpty(B) ? null : B.f3)); > > > > I realize I could do FULL JOIN, but the problem is that after join, I > > wouldn't know which id is null, I would have to do many if then in the > > following generate command and I hope I can avoid that. E.g., > > > > C = JOIN A BY id FULL, B BY id; > > DUMP C > > (1,a,b,1,x) > > (2,c,d,2,y) > > (3,e,f,,) > > (,,,6,z) > > (,,,8,w) > > > > > > > DESCRIBE C; > > C: {A::id: bytearray,A::f1: bytearray,A::f2: bytearray,B::id: > > bytearray,B::f3: bytearray} > > > > Sometimes A::id is null, sometimes B::id null, I always only want the > > non-null id in my output. > > > > You can get this by using the conditional expression (called bincond in pig > documents) (? : ). > > E = foreach C generate (A::id is null ? B::id : A::id), A::F1, A::F2, > B::F3; > > -Thejas > > > >
+
Dexin Wang 2010-12-31, 04:02
|
|