Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> COUNT() thinks non-null tuples are null if the first field is null?


+
Adair Kovac 2013-02-05, 22:14
+
Adair Kovac 2013-02-05, 22:30
Copy link to this message
-
Re: COUNT() thinks non-null tuples are null if the first field is null?
This behavior is discussed in the count docs:

http://pig.apache.org/docs/r0.10.0/func.html#count

The COUNT function follows syntax semantics and ignores nulls. What this
means is that a tuple in the bag will not be counted if the FIRST FIELD in
this tuple is NULL. If you want to include NULL values in the count
computation, use
COUNT_STAR<http://pig.apache.org/docs/r0.10.0/func.html#COUNT-STAR>
.

There is a proposal to change this though, which provides more context:
https://issues.apache.org/jira/browse/PIG-1014

On Tue, Feb 5, 2013 at 2:30 PM, Adair Kovac <[EMAIL PROTECTED]> wrote:

> Sorry, correcting an imprecision here--field1 is the first field of the
> records that have been grouped; that it's the first field in the key is
> nonessential. So basically *any* group/count that I have done in the past
> could have been dropping records because the first field happened to be
> something I didn't care about at the time that could be null. I am
> distressed by this realization.
>
> Thanks again,
>
> Adair
>
> On Tue, Feb 5, 2013 at 3:14 PM, Adair Kovac <[EMAIL PROTECTED]> wrote:
>
> > Hi, guys, was wondering what's going on with this.
> >
> > In pig 0.9 if I do something like this:
> >
> > grouped = group data by (field1, field2);
> > count = foreach grouped generate COUNT(data);
> >
> > That count is 0 wherever field1 is null regardless of what comes after.
> >
> > I can use COUNT_STAR() instead (data fresh from a group won't have any
> > null records, right?), but it seems like that should be the expected
> > behavior of COUNT().
> >
> > This was obviously intended behavior, since it's right there in the
> > function:
> >
> > if (t != null && t.size() > 0 && t.get(0) != null )
> >                             cnt++;
> >
> > but it just seems bizarre and inconvenient to me. Nor is it mentioned in
> > the documentation, unless the bit written for people who are good at SQL
> > implies it. Now I'm wondering which of my past scripts might be buggy
> > because I didn't expect this behavior.
> >
> > Anyone have an explanation?
> >
> > Thanks,
> >
> > Adair
> >
>

--
*Note that I'm no longer using my Yahoo! email address. Please email me at
[EMAIL PROTECTED] going forward.*
+
Adair Kovac 2013-02-06, 16:32
+
Russell Jurney 2013-02-06, 20:35
+
Vitalii Tymchyshyn 2013-02-07, 21:12
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB