Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - FLATTEN eats null rows?


Copy link to this message
-
FLATTEN eats null rows?
Dexin Wang 2010-12-31, 00:35
Seems after FLATTEN, the rows with null values get dropped.

I have two test files:

% cat test1.txt
1 a b
2 c d
3 e f

% cat test2.txt
1 x
2 y
6 z
8 w

I'm trying to cogroup the two on the first column:

A = LOAD 'test1.txt' AS (id, f1, f2);
B = LOAD 'test2.txt' AS (id, f3);
C = COGROUP A BY id, B BY id;
DUMP C;

(1,{(1,a,b)},{(1,x)})
(2,{(2,c,d)},{(2,y)})
(3,{(3,e,f)},{})
(6,{},{(6,z)})
(8,{},{(8,w)})

D = FOREACH C GENERATE group, FLATTEN(A.(f1, f2)), FLATTEN(B.f3);
DUMP D;

(1,a,b,x)
(2,c,d,y)

E = FOREACH C GENERATE group, A.(f1, f2), B.f3;
DUMP E

(1,{(a,b)},{(x)})
(2,{(c,d)},{(y)})
(3,{(e,f)},{})
(6,{},{(z)})
(8,{},{(w)})

You see if I do FLATTEN, all the rows with null values are all missing (in
D). If I don't do FLATTEN, as in E, I have all the rows but not flattened,
obviously. What I want as the end result is:

(1,a,b,x)
(2,c,d,y)
(3,e,f,{})
(6,,,z)
(8,,,w)

How can I get that? Thanks.

Dexin

P.S.

I realize I could do FULL JOIN, but the problem is that after join, I
wouldn't know which id is null, I would have to do many if then in the
following generate command and I hope I can avoid that. E.g.,

C = JOIN A BY id FULL, B BY id;
DUMP C
(1,a,b,1,x)
(2,c,d,2,y)
(3,e,f,,)
(,,,6,z)
(,,,8,w)

DESCRIBE C;
C: {A::id: bytearray,A::f1: bytearray,A::f2: bytearray,B::id:
bytearray,B::f3: bytearray}

Sometimes A::id is null, sometimes B::id null, I always only want the
non-null id in my output.