Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> COUNT() thinks non-null tuples are null if the first field is null?


Copy link to this message
-
Re: COUNT() thinks non-null tuples are null if the first field is null?
Old db trick. If you want complex count, use sum(
complex_boolean_expression ? 1 : 0 )
I think, it should work for pig
2013/2/6 Adair Kovac <[EMAIL PROTECTED]>

> Thanks Bill! I didn't see that documentation--only the javadoc, which for
> Pig 9.1 definitely doesn't explain that. I realize now I really shouldn't
> be trusting a Google search for pig-related documentation because the
> search engine seems to get confused by all the versions.
>
> I also eventually realized what this (from the javadoc) means: "Generates
> the count of the number of values in a bag. This count does not include
> null values, and thus matches SQL semantics for COUNT(a) (where a is field)
> but not for COUNT(*) (where * in SQL indicates all)."--COUNT is supposed to
> run over a single field. I think it's bad for the documentation to conflate
> the two meanings of "null value". For what it's worth, in regards to the
> jira proposal, I think if it retains its current behavior, COUNT should
> error if the tuple size is greater than 1.
>
> Okay, next question:
>
> Say I'm doing counts on votes cast for student government at a large
> university. Every position (president, vice president, down to dozens of
> senators) is optional: a student can vote for them or not vote for them.
> Many students don't vote, but the data I have reads that as a tuple of
> nulls. I want to group on president and vice president and count all the
> records where not every field is null (i.e., I want my counts to include
> one for the number of students who voted for some office, but neither
> president or vice president). I can't use COUNT_STAR, because it would
> count the non-voters (I think). I can't use COUNT, because all students who
> didn't vote for whatever column I passed in (or whatever column was first)
> would get dropped. Filtering on $0 is not null or $1 is not null etc would
> get really cumbersome, since I have dozens of fields. Is there a better
> solution?
>
> Thanks!
>
> Adair
>
> On Tue, Feb 5, 2013 at 9:33 PM, Bill Graham <[EMAIL PROTECTED]> wrote:
>
> > This behavior is discussed in the count docs:
> >
> > http://pig.apache.org/docs/r0.10.0/func.html#count
> >
> > The COUNT function follows syntax semantics and ignores nulls. What this
> > means is that a tuple in the bag will not be counted if the FIRST FIELD
> in
> > this tuple is NULL. If you want to include NULL values in the count
> > computation, use
> > COUNT_STAR<http://pig.apache.org/docs/r0.10.0/func.html#COUNT-STAR>
> > .
> >
> > There is a proposal to change this though, which provides more context:
> > https://issues.apache.org/jira/browse/PIG-1014
> >
> > On Tue, Feb 5, 2013 at 2:30 PM, Adair Kovac <[EMAIL PROTECTED]>
> wrote:
> >
> > > Sorry, correcting an imprecision here--field1 is the first field of the
> > > records that have been grouped; that it's the first field in the key is
> > > nonessential. So basically *any* group/count that I have done in the
> past
> > > could have been dropping records because the first field happened to be
> > > something I didn't care about at the time that could be null. I am
> > > distressed by this realization.
> > >
> > > Thanks again,
> > >
> > > Adair
> > >
> > > On Tue, Feb 5, 2013 at 3:14 PM, Adair Kovac <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > > Hi, guys, was wondering what's going on with this.
> > > >
> > > > In pig 0.9 if I do something like this:
> > > >
> > > > grouped = group data by (field1, field2);
> > > > count = foreach grouped generate COUNT(data);
> > > >
> > > > That count is 0 wherever field1 is null regardless of what comes
> after.
> > > >
> > > > I can use COUNT_STAR() instead (data fresh from a group won't have
> any
> > > > null records, right?), but it seems like that should be the expected
> > > > behavior of COUNT().
> > > >
> > > > This was obviously intended behavior, since it's right there in the
> > > > function:
> > > >
> > > > if (t != null && t.size() > 0 && t.get(0) != null )
> > > >                             cnt++;

Best regards,
 Vitalii Tymchyshyn