Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> COUNT() thinks non-null tuples are null if the first field is null?


Copy link to this message
-
Re: COUNT() thinks non-null tuples are null if the first field is null?
Sorry, correcting an imprecision here--field1 is the first field of the
records that have been grouped; that it's the first field in the key is
nonessential. So basically *any* group/count that I have done in the past
could have been dropping records because the first field happened to be
something I didn't care about at the time that could be null. I am
distressed by this realization.

Thanks again,

Adair

On Tue, Feb 5, 2013 at 3:14 PM, Adair Kovac <[EMAIL PROTECTED]> wrote:

> Hi, guys, was wondering what's going on with this.
>
> In pig 0.9 if I do something like this:
>
> grouped = group data by (field1, field2);
> count = foreach grouped generate COUNT(data);
>
> That count is 0 wherever field1 is null regardless of what comes after.
>
> I can use COUNT_STAR() instead (data fresh from a group won't have any
> null records, right?), but it seems like that should be the expected
> behavior of COUNT().
>
> This was obviously intended behavior, since it's right there in the
> function:
>
> if (t != null && t.size() > 0 && t.get(0) != null )
>                             cnt++;
>
> but it just seems bizarre and inconvenient to me. Nor is it mentioned in
> the documentation, unless the bit written for people who are good at SQL
> implies it. Now I'm wondering which of my past scripts might be buggy
> because I didn't expect this behavior.
>
> Anyone have an explanation?
>
> Thanks,
>
> Adair
>