Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - COUNT() thinks non-null tuples are null if the first field is null?


Copy link to this message
-
Re: COUNT() thinks non-null tuples are null if the first field is null?
Bill Graham 2013-02-06, 04:33
This behavior is discussed in the count docs:

http://pig.apache.org/docs/r0.10.0/func.html#count

The COUNT function follows syntax semantics and ignores nulls. What this
means is that a tuple in the bag will not be counted if the FIRST FIELD in
this tuple is NULL. If you want to include NULL values in the count
computation, use
COUNT_STAR<http://pig.apache.org/docs/r0.10.0/func.html#COUNT-STAR>
.

There is a proposal to change this though, which provides more context:
https://issues.apache.org/jira/browse/PIG-1014

On Tue, Feb 5, 2013 at 2:30 PM, Adair Kovac <[EMAIL PROTECTED]> wrote:

> Sorry, correcting an imprecision here--field1 is the first field of the
> records that have been grouped; that it's the first field in the key is
> nonessential. So basically *any* group/count that I have done in the past
> could have been dropping records because the first field happened to be
> something I didn't care about at the time that could be null. I am
> distressed by this realization.
>
> Thanks again,
>
> Adair
>
> On Tue, Feb 5, 2013 at 3:14 PM, Adair Kovac <[EMAIL PROTECTED]> wrote:
>
> > Hi, guys, was wondering what's going on with this.
> >
> > In pig 0.9 if I do something like this:
> >
> > grouped = group data by (field1, field2);
> > count = foreach grouped generate COUNT(data);
> >
> > That count is 0 wherever field1 is null regardless of what comes after.
> >
> > I can use COUNT_STAR() instead (data fresh from a group won't have any
> > null records, right?), but it seems like that should be the expected
> > behavior of COUNT().
> >
> > This was obviously intended behavior, since it's right there in the
> > function:
> >
> > if (t != null && t.size() > 0 && t.get(0) != null )
> >                             cnt++;
> >
> > but it just seems bizarre and inconvenient to me. Nor is it mentioned in
> > the documentation, unless the bit written for people who are good at SQL
> > implies it. Now I'm wondering which of my past scripts might be buggy
> > because I didn't expect this behavior.
> >
> > Anyone have an explanation?
> >
> > Thanks,
> >
> > Adair
> >
>

--
*Note that I'm no longer using my Yahoo! email address. Please email me at
[EMAIL PROTECTED] going forward.*