Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> COUNT() thinks non-null tuples are null if the first field is null?


Copy link to this message
-
Re: COUNT() thinks non-null tuples are null if the first field is null?
Sorry, correcting an imprecision here--field1 is the first field of the
records that have been grouped; that it's the first field in the key is
nonessential. So basically *any* group/count that I have done in the past
could have been dropping records because the first field happened to be
something I didn't care about at the time that could be null. I am
distressed by this realization.

Thanks again,

Adair

On Tue, Feb 5, 2013 at 3:14 PM, Adair Kovac <[EMAIL PROTECTED]> wrote:

> Hi, guys, was wondering what's going on with this.
>
> In pig 0.9 if I do something like this:
>
> grouped = group data by (field1, field2);
> count = foreach grouped generate COUNT(data);
>
> That count is 0 wherever field1 is null regardless of what comes after.
>
> I can use COUNT_STAR() instead (data fresh from a group won't have any
> null records, right?), but it seems like that should be the expected
> behavior of COUNT().
>
> This was obviously intended behavior, since it's right there in the
> function:
>
> if (t != null && t.size() > 0 && t.get(0) != null )
>                             cnt++;
>
> but it just seems bizarre and inconvenient to me. Nor is it mentioned in
> the documentation, unless the bit written for people who are good at SQL
> implies it. Now I'm wondering which of my past scripts might be buggy
> because I didn't expect this behavior.
>
> Anyone have an explanation?
>
> Thanks,
>
> Adair
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB