Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> COUNT() thinks non-null tuples are null if the first field is null?


Copy link to this message
-
COUNT() thinks non-null tuples are null if the first field is null?
Hi, guys, was wondering what's going on with this.

In pig 0.9 if I do something like this:

grouped = group data by (field1, field2);
count = foreach grouped generate COUNT(data);

That count is 0 wherever field1 is null regardless of what comes after.

I can use COUNT_STAR() instead (data fresh from a group won't have any null
records, right?), but it seems like that should be the expected behavior of
COUNT().

This was obviously intended behavior, since it's right there in the
function:

if (t != null && t.size() > 0 && t.get(0) != null )
                            cnt++;

but it just seems bizarre and inconvenient to me. Nor is it mentioned in
the documentation, unless the bit written for people who are good at SQL
implies it. Now I'm wondering which of my past scripts might be buggy
because I didn't expect this behavior.

Anyone have an explanation?

Thanks,

Adair