Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> FLATTEN eats null rows?


Copy link to this message
-
FLATTEN eats null rows?
Seems after FLATTEN, the rows with null values get dropped.

I have two test files:

% cat test1.txt
1 a b
2 c d
3 e f

% cat test2.txt
1 x
2 y
6 z
8 w

I'm trying to cogroup the two on the first column:

A = LOAD 'test1.txt' AS (id, f1, f2);
B = LOAD 'test2.txt' AS (id, f3);
C = COGROUP A BY id, B BY id;
DUMP C;

(1,{(1,a,b)},{(1,x)})
(2,{(2,c,d)},{(2,y)})
(3,{(3,e,f)},{})
(6,{},{(6,z)})
(8,{},{(8,w)})

D = FOREACH C GENERATE group, FLATTEN(A.(f1, f2)), FLATTEN(B.f3);
DUMP D;

(1,a,b,x)
(2,c,d,y)

E = FOREACH C GENERATE group, A.(f1, f2), B.f3;
DUMP E

(1,{(a,b)},{(x)})
(2,{(c,d)},{(y)})
(3,{(e,f)},{})
(6,{},{(z)})
(8,{},{(w)})

You see if I do FLATTEN, all the rows with null values are all missing (in
D). If I don't do FLATTEN, as in E, I have all the rows but not flattened,
obviously. What I want as the end result is:

(1,a,b,x)
(2,c,d,y)
(3,e,f,{})
(6,,,z)
(8,,,w)

How can I get that? Thanks.

Dexin

P.S.

I realize I could do FULL JOIN, but the problem is that after join, I
wouldn't know which id is null, I would have to do many if then in the
following generate command and I hope I can avoid that. E.g.,

C = JOIN A BY id FULL, B BY id;
DUMP C
(1,a,b,1,x)
(2,c,d,2,y)
(3,e,f,,)
(,,,6,z)
(,,,8,w)

DESCRIBE C;
C: {A::id: bytearray,A::f1: bytearray,A::f2: bytearray,B::id:
bytearray,B::f3: bytearray}

Sometimes A::id is null, sometimes B::id null, I always only want the
non-null id in my output.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB