Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> FLATTEN eats null rows?


Copy link to this message
-
Re: FLATTEN eats null rows?
Thanks. Both worked fine.

I think I'll make a MyFlatten that doesn't drop the empty bag. Say you want
to COGROUP 3 or more bags, you would have to do a many COGROUP or JOIN, then
do IsEmpty or bincond every time. Istead, with MyFlatten, I would do:

X = COGROUP A BY id, B BY id, C BY id, D BY id;
Y = FOREACH X GENERATE group, FLATTEN(A.(f1, f2)), FLATTEN(B.(f3,f4,f5)),
FLATTEN(C.f6), FLATTEN(D.f7);

code will be a lot conciser and cleaner.

On Thu, Dec 30, 2010 at 6:46 PM, Thejas M Nair <[EMAIL PROTECTED]> wrote:

>
>
>
> On 12/30/10 4:35 PM, "Dexin Wang" <[EMAIL PROTECTED]> wrote:
>
> > Seems after FLATTEN, the rows with null values get dropped.
> >
> What you are seeing is the expected/documented behavior of flatten -
> http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Flatten+Operator
> "Note that the flatten of empty bag will result in that row being
> discarded"
> (Note that its 'empty bag' not 'null').
>
>
> >
> > You see if I do FLATTEN, all the rows with null values are all missing
> (in
> > D). If I don't do FLATTEN, as in E, I have all the rows but not
> flattened,
> > obviously. What I want as the end result is:
> >
> > (1,a,b,x)
> > (2,c,d,y)
> > (3,e,f,{})
> > (6,,,z)
> > (8,,,w)
> >
> > How can I get that? Thanks.
> >
>
>  D = FOREACH C GENERATE group,  FLATTEN((IsEmpty(A) ? null : A.(f1,f2) )),
> FLATTEN((IsEmpty(B) ? null : B.f3));
>
>
> > I realize I could do FULL JOIN, but the problem is that after join, I
> > wouldn't know which id is null, I would have to do many if then in the
> > following generate command and I hope I can avoid that. E.g.,
> >
> > C = JOIN A BY id FULL, B BY id;
> > DUMP C
> > (1,a,b,1,x)
> > (2,c,d,2,y)
> > (3,e,f,,)
> > (,,,6,z)
> > (,,,8,w)
> >
>
>
>
> > DESCRIBE C;
> > C: {A::id: bytearray,A::f1: bytearray,A::f2: bytearray,B::id:
> > bytearray,B::f3: bytearray}
> >
> > Sometimes A::id is null, sometimes B::id null, I always only want the
> > non-null id in my output.
> >
>
> You can get this by using the conditional expression (called bincond in pig
> documents) (? : ).
>
> E = foreach C generate (A::id is null ? B::id : A::id), A::F1, A::F2,
> B::F3;
>
> -Thejas
>
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB