Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> FLATTEN eats null rows?


Copy link to this message
-
Re: FLATTEN eats null rows?

On 12/30/10 4:35 PM, "Dexin Wang" <[EMAIL PROTECTED]> wrote:

> Seems after FLATTEN, the rows with null values get dropped.
>
What you are seeing is the expected/documented behavior of flatten -
http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Flatten+Operator
"Note that the flatten of empty bag will result in that row being discarded"
(Note that its 'empty bag' not 'null').
>
> You see if I do FLATTEN, all the rows with null values are all missing (in
> D). If I don't do FLATTEN, as in E, I have all the rows but not flattened,
> obviously. What I want as the end result is:
>
> (1,a,b,x)
> (2,c,d,y)
> (3,e,f,{})
> (6,,,z)
> (8,,,w)
>
> How can I get that? Thanks.
>

 D = FOREACH C GENERATE group,  FLATTEN((IsEmpty(A) ? null : A.(f1,f2) )),
FLATTEN((IsEmpty(B) ? null : B.f3));
 

> I realize I could do FULL JOIN, but the problem is that after join, I
> wouldn't know which id is null, I would have to do many if then in the
> following generate command and I hope I can avoid that. E.g.,
>
> C = JOIN A BY id FULL, B BY id;
> DUMP C
> (1,a,b,1,x)
> (2,c,d,2,y)
> (3,e,f,,)
> (,,,6,z)
> (,,,8,w)
>

> DESCRIBE C;
> C: {A::id: bytearray,A::f1: bytearray,A::f2: bytearray,B::id:
> bytearray,B::f3: bytearray}
>
> Sometimes A::id is null, sometimes B::id null, I always only want the
> non-null id in my output.
>

You can get this by using the conditional expression (called bincond in pig
documents) (? : ).

E = foreach C generate (A::id is null ? B::id : A::id), A::F1, A::F2, B::F3;

-Thejas
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB