Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> COUNT() thinks non-null tuples are null if the first field is null?


Copy link to this message
-
Re: COUNT() thinks non-null tuples are null if the first field is null?
CUBE?
On Feb 6, 2013 8:33 AM, "Adair Kovac" <[EMAIL PROTECTED]> wrote:

> Thanks Bill! I didn't see that documentation--only the javadoc, which for
> Pig 9.1 definitely doesn't explain that. I realize now I really shouldn't
> be trusting a Google search for pig-related documentation because the
> search engine seems to get confused by all the versions.
>
> I also eventually realized what this (from the javadoc) means: "Generates
> the count of the number of values in a bag. This count does not include
> null values, and thus matches SQL semantics for COUNT(a) (where a is field)
> but not for COUNT(*) (where * in SQL indicates all)."--COUNT is supposed to
> run over a single field. I think it's bad for the documentation to conflate
> the two meanings of "null value". For what it's worth, in regards to the
> jira proposal, I think if it retains its current behavior, COUNT should
> error if the tuple size is greater than 1.
>
> Okay, next question:
>
> Say I'm doing counts on votes cast for student government at a large
> university. Every position (president, vice president, down to dozens of
> senators) is optional: a student can vote for them or not vote for them.
> Many students don't vote, but the data I have reads that as a tuple of
> nulls. I want to group on president and vice president and count all the
> records where not every field is null (i.e., I want my counts to include
> one for the number of students who voted for some office, but neither
> president or vice president). I can't use COUNT_STAR, because it would
> count the non-voters (I think). I can't use COUNT, because all students who
> didn't vote for whatever column I passed in (or whatever column was first)
> would get dropped. Filtering on $0 is not null or $1 is not null etc would
> get really cumbersome, since I have dozens of fields. Is there a better
> solution?
>
> Thanks!
>
> Adair
>
> On Tue, Feb 5, 2013 at 9:33 PM, Bill Graham <[EMAIL PROTECTED]> wrote:
>
> > This behavior is discussed in the count docs:
> >
> > http://pig.apache.org/docs/r0.10.0/func.html#count
> >
> > The COUNT function follows syntax semantics and ignores nulls. What this
> > means is that a tuple in the bag will not be counted if the FIRST FIELD
> in
> > this tuple is NULL. If you want to include NULL values in the count
> > computation, use
> > COUNT_STAR<http://pig.apache.org/docs/r0.10.0/func.html#COUNT-STAR>
> > .
> >
> > There is a proposal to change this though, which provides more context:
> > https://issues.apache.org/jira/browse/PIG-1014
> >
> > On Tue, Feb 5, 2013 at 2:30 PM, Adair Kovac <[EMAIL PROTECTED]>
> wrote:
> >
> > > Sorry, correcting an imprecision here--field1 is the first field of the
> > > records that have been grouped; that it's the first field in the key is
> > > nonessential. So basically *any* group/count that I have done in the
> past
> > > could have been dropping records because the first field happened to be
> > > something I didn't care about at the time that could be null. I am
> > > distressed by this realization.
> > >
> > > Thanks again,
> > >
> > > Adair
> > >
> > > On Tue, Feb 5, 2013 at 3:14 PM, Adair Kovac <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > > Hi, guys, was wondering what's going on with this.
> > > >
> > > > In pig 0.9 if I do something like this:
> > > >
> > > > grouped = group data by (field1, field2);
> > > > count = foreach grouped generate COUNT(data);
> > > >
> > > > That count is 0 wherever field1 is null regardless of what comes
> after.
> > > >
> > > > I can use COUNT_STAR() instead (data fresh from a group won't have
> any
> > > > null records, right?), but it seems like that should be the expected
> > > > behavior of COUNT().
> > > >
> > > > This was obviously intended behavior, since it's right there in the
> > > > function:
> > > >
> > > > if (t != null && t.size() > 0 && t.get(0) != null )
> > > >                             cnt++;
> > > >
> > > > but it just seems bizarre and inconvenient to me. Nor is it mentioned
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB