|
Adair Kovac
2013-02-05, 22:14
Adair Kovac
2013-02-05, 22:30
Bill Graham
2013-02-06, 04:33
Adair Kovac
2013-02-06, 16:32
Russell Jurney
2013-02-06, 20:35
Vitalii Tymchyshyn
2013-02-07, 21:12
|
-
COUNT() thinks non-null tuples are null if the first field is null?Adair Kovac 2013-02-05, 22:14
Hi, guys, was wondering what's going on with this.
In pig 0.9 if I do something like this: grouped = group data by (field1, field2); count = foreach grouped generate COUNT(data); That count is 0 wherever field1 is null regardless of what comes after. I can use COUNT_STAR() instead (data fresh from a group won't have any null records, right?), but it seems like that should be the expected behavior of COUNT(). This was obviously intended behavior, since it's right there in the function: if (t != null && t.size() > 0 && t.get(0) != null ) cnt++; but it just seems bizarre and inconvenient to me. Nor is it mentioned in the documentation, unless the bit written for people who are good at SQL implies it. Now I'm wondering which of my past scripts might be buggy because I didn't expect this behavior. Anyone have an explanation? Thanks, Adair
-
Re: COUNT() thinks non-null tuples are null if the first field is null?Adair Kovac 2013-02-05, 22:30
Sorry, correcting an imprecision here--field1 is the first field of the
records that have been grouped; that it's the first field in the key is nonessential. So basically *any* group/count that I have done in the past could have been dropping records because the first field happened to be something I didn't care about at the time that could be null. I am distressed by this realization. Thanks again, Adair On Tue, Feb 5, 2013 at 3:14 PM, Adair Kovac <[EMAIL PROTECTED]> wrote: > Hi, guys, was wondering what's going on with this. > > In pig 0.9 if I do something like this: > > grouped = group data by (field1, field2); > count = foreach grouped generate COUNT(data); > > That count is 0 wherever field1 is null regardless of what comes after. > > I can use COUNT_STAR() instead (data fresh from a group won't have any > null records, right?), but it seems like that should be the expected > behavior of COUNT(). > > This was obviously intended behavior, since it's right there in the > function: > > if (t != null && t.size() > 0 && t.get(0) != null ) > cnt++; > > but it just seems bizarre and inconvenient to me. Nor is it mentioned in > the documentation, unless the bit written for people who are good at SQL > implies it. Now I'm wondering which of my past scripts might be buggy > because I didn't expect this behavior. > > Anyone have an explanation? > > Thanks, > > Adair >
-
Re: COUNT() thinks non-null tuples are null if the first field is null?Bill Graham 2013-02-06, 04:33
This behavior is discussed in the count docs:
http://pig.apache.org/docs/r0.10.0/func.html#count The COUNT function follows syntax semantics and ignores nulls. What this means is that a tuple in the bag will not be counted if the FIRST FIELD in this tuple is NULL. If you want to include NULL values in the count computation, use COUNT_STAR<http://pig.apache.org/docs/r0.10.0/func.html#COUNT-STAR> . There is a proposal to change this though, which provides more context: https://issues.apache.org/jira/browse/PIG-1014 On Tue, Feb 5, 2013 at 2:30 PM, Adair Kovac <[EMAIL PROTECTED]> wrote: > Sorry, correcting an imprecision here--field1 is the first field of the > records that have been grouped; that it's the first field in the key is > nonessential. So basically *any* group/count that I have done in the past > could have been dropping records because the first field happened to be > something I didn't care about at the time that could be null. I am > distressed by this realization. > > Thanks again, > > Adair > > On Tue, Feb 5, 2013 at 3:14 PM, Adair Kovac <[EMAIL PROTECTED]> wrote: > > > Hi, guys, was wondering what's going on with this. > > > > In pig 0.9 if I do something like this: > > > > grouped = group data by (field1, field2); > > count = foreach grouped generate COUNT(data); > > > > That count is 0 wherever field1 is null regardless of what comes after. > > > > I can use COUNT_STAR() instead (data fresh from a group won't have any > > null records, right?), but it seems like that should be the expected > > behavior of COUNT(). > > > > This was obviously intended behavior, since it's right there in the > > function: > > > > if (t != null && t.size() > 0 && t.get(0) != null ) > > cnt++; > > > > but it just seems bizarre and inconvenient to me. Nor is it mentioned in > > the documentation, unless the bit written for people who are good at SQL > > implies it. Now I'm wondering which of my past scripts might be buggy > > because I didn't expect this behavior. > > > > Anyone have an explanation? > > > > Thanks, > > > > Adair > > > -- *Note that I'm no longer using my Yahoo! email address. Please email me at [EMAIL PROTECTED] going forward.*
-
Re: COUNT() thinks non-null tuples are null if the first field is null?Adair Kovac 2013-02-06, 16:32
Thanks Bill! I didn't see that documentation--only the javadoc, which for
Pig 9.1 definitely doesn't explain that. I realize now I really shouldn't be trusting a Google search for pig-related documentation because the search engine seems to get confused by all the versions. I also eventually realized what this (from the javadoc) means: "Generates the count of the number of values in a bag. This count does not include null values, and thus matches SQL semantics for COUNT(a) (where a is field) but not for COUNT(*) (where * in SQL indicates all)."--COUNT is supposed to run over a single field. I think it's bad for the documentation to conflate the two meanings of "null value". For what it's worth, in regards to the jira proposal, I think if it retains its current behavior, COUNT should error if the tuple size is greater than 1. Okay, next question: Say I'm doing counts on votes cast for student government at a large university. Every position (president, vice president, down to dozens of senators) is optional: a student can vote for them or not vote for them. Many students don't vote, but the data I have reads that as a tuple of nulls. I want to group on president and vice president and count all the records where not every field is null (i.e., I want my counts to include one for the number of students who voted for some office, but neither president or vice president). I can't use COUNT_STAR, because it would count the non-voters (I think). I can't use COUNT, because all students who didn't vote for whatever column I passed in (or whatever column was first) would get dropped. Filtering on $0 is not null or $1 is not null etc would get really cumbersome, since I have dozens of fields. Is there a better solution? Thanks! Adair On Tue, Feb 5, 2013 at 9:33 PM, Bill Graham <[EMAIL PROTECTED]> wrote: > This behavior is discussed in the count docs: > > http://pig.apache.org/docs/r0.10.0/func.html#count > > The COUNT function follows syntax semantics and ignores nulls. What this > means is that a tuple in the bag will not be counted if the FIRST FIELD in > this tuple is NULL. If you want to include NULL values in the count > computation, use > COUNT_STAR<http://pig.apache.org/docs/r0.10.0/func.html#COUNT-STAR> > . > > There is a proposal to change this though, which provides more context: > https://issues.apache.org/jira/browse/PIG-1014 > > On Tue, Feb 5, 2013 at 2:30 PM, Adair Kovac <[EMAIL PROTECTED]> wrote: > > > Sorry, correcting an imprecision here--field1 is the first field of the > > records that have been grouped; that it's the first field in the key is > > nonessential. So basically *any* group/count that I have done in the past > > could have been dropping records because the first field happened to be > > something I didn't care about at the time that could be null. I am > > distressed by this realization. > > > > Thanks again, > > > > Adair > > > > On Tue, Feb 5, 2013 at 3:14 PM, Adair Kovac <[EMAIL PROTECTED]> > wrote: > > > > > Hi, guys, was wondering what's going on with this. > > > > > > In pig 0.9 if I do something like this: > > > > > > grouped = group data by (field1, field2); > > > count = foreach grouped generate COUNT(data); > > > > > > That count is 0 wherever field1 is null regardless of what comes after. > > > > > > I can use COUNT_STAR() instead (data fresh from a group won't have any > > > null records, right?), but it seems like that should be the expected > > > behavior of COUNT(). > > > > > > This was obviously intended behavior, since it's right there in the > > > function: > > > > > > if (t != null && t.size() > 0 && t.get(0) != null ) > > > cnt++; > > > > > > but it just seems bizarre and inconvenient to me. Nor is it mentioned > in > > > the documentation, unless the bit written for people who are good at > SQL > > > implies it. Now I'm wondering which of my past scripts might be buggy > > > because I didn't expect this behavior. > > > > > > Anyone have an explanation?
-
Re: COUNT() thinks non-null tuples are null if the first field is null?Russell Jurney 2013-02-06, 20:35
CUBE?
On Feb 6, 2013 8:33 AM, "Adair Kovac" <[EMAIL PROTECTED]> wrote: > Thanks Bill! I didn't see that documentation--only the javadoc, which for > Pig 9.1 definitely doesn't explain that. I realize now I really shouldn't > be trusting a Google search for pig-related documentation because the > search engine seems to get confused by all the versions. > > I also eventually realized what this (from the javadoc) means: "Generates > the count of the number of values in a bag. This count does not include > null values, and thus matches SQL semantics for COUNT(a) (where a is field) > but not for COUNT(*) (where * in SQL indicates all)."--COUNT is supposed to > run over a single field. I think it's bad for the documentation to conflate > the two meanings of "null value". For what it's worth, in regards to the > jira proposal, I think if it retains its current behavior, COUNT should > error if the tuple size is greater than 1. > > Okay, next question: > > Say I'm doing counts on votes cast for student government at a large > university. Every position (president, vice president, down to dozens of > senators) is optional: a student can vote for them or not vote for them. > Many students don't vote, but the data I have reads that as a tuple of > nulls. I want to group on president and vice president and count all the > records where not every field is null (i.e., I want my counts to include > one for the number of students who voted for some office, but neither > president or vice president). I can't use COUNT_STAR, because it would > count the non-voters (I think). I can't use COUNT, because all students who > didn't vote for whatever column I passed in (or whatever column was first) > would get dropped. Filtering on $0 is not null or $1 is not null etc would > get really cumbersome, since I have dozens of fields. Is there a better > solution? > > Thanks! > > Adair > > On Tue, Feb 5, 2013 at 9:33 PM, Bill Graham <[EMAIL PROTECTED]> wrote: > > > This behavior is discussed in the count docs: > > > > http://pig.apache.org/docs/r0.10.0/func.html#count > > > > The COUNT function follows syntax semantics and ignores nulls. What this > > means is that a tuple in the bag will not be counted if the FIRST FIELD > in > > this tuple is NULL. If you want to include NULL values in the count > > computation, use > > COUNT_STAR<http://pig.apache.org/docs/r0.10.0/func.html#COUNT-STAR> > > . > > > > There is a proposal to change this though, which provides more context: > > https://issues.apache.org/jira/browse/PIG-1014 > > > > On Tue, Feb 5, 2013 at 2:30 PM, Adair Kovac <[EMAIL PROTECTED]> > wrote: > > > > > Sorry, correcting an imprecision here--field1 is the first field of the > > > records that have been grouped; that it's the first field in the key is > > > nonessential. So basically *any* group/count that I have done in the > past > > > could have been dropping records because the first field happened to be > > > something I didn't care about at the time that could be null. I am > > > distressed by this realization. > > > > > > Thanks again, > > > > > > Adair > > > > > > On Tue, Feb 5, 2013 at 3:14 PM, Adair Kovac <[EMAIL PROTECTED]> > > wrote: > > > > > > > Hi, guys, was wondering what's going on with this. > > > > > > > > In pig 0.9 if I do something like this: > > > > > > > > grouped = group data by (field1, field2); > > > > count = foreach grouped generate COUNT(data); > > > > > > > > That count is 0 wherever field1 is null regardless of what comes > after. > > > > > > > > I can use COUNT_STAR() instead (data fresh from a group won't have > any > > > > null records, right?), but it seems like that should be the expected > > > > behavior of COUNT(). > > > > > > > > This was obviously intended behavior, since it's right there in the > > > > function: > > > > > > > > if (t != null && t.size() > 0 && t.get(0) != null ) > > > > cnt++; > > > > > > > > but it just seems bizarre and inconvenient to me. Nor is it mentioned
-
Re: COUNT() thinks non-null tuples are null if the first field is null?Vitalii Tymchyshyn 2013-02-07, 21:12
Old db trick. If you want complex count, use sum(
complex_boolean_expression ? 1 : 0 ) I think, it should work for pig 2013/2/6 Adair Kovac <[EMAIL PROTECTED]> > Thanks Bill! I didn't see that documentation--only the javadoc, which for > Pig 9.1 definitely doesn't explain that. I realize now I really shouldn't > be trusting a Google search for pig-related documentation because the > search engine seems to get confused by all the versions. > > I also eventually realized what this (from the javadoc) means: "Generates > the count of the number of values in a bag. This count does not include > null values, and thus matches SQL semantics for COUNT(a) (where a is field) > but not for COUNT(*) (where * in SQL indicates all)."--COUNT is supposed to > run over a single field. I think it's bad for the documentation to conflate > the two meanings of "null value". For what it's worth, in regards to the > jira proposal, I think if it retains its current behavior, COUNT should > error if the tuple size is greater than 1. > > Okay, next question: > > Say I'm doing counts on votes cast for student government at a large > university. Every position (president, vice president, down to dozens of > senators) is optional: a student can vote for them or not vote for them. > Many students don't vote, but the data I have reads that as a tuple of > nulls. I want to group on president and vice president and count all the > records where not every field is null (i.e., I want my counts to include > one for the number of students who voted for some office, but neither > president or vice president). I can't use COUNT_STAR, because it would > count the non-voters (I think). I can't use COUNT, because all students who > didn't vote for whatever column I passed in (or whatever column was first) > would get dropped. Filtering on $0 is not null or $1 is not null etc would > get really cumbersome, since I have dozens of fields. Is there a better > solution? > > Thanks! > > Adair > > On Tue, Feb 5, 2013 at 9:33 PM, Bill Graham <[EMAIL PROTECTED]> wrote: > > > This behavior is discussed in the count docs: > > > > http://pig.apache.org/docs/r0.10.0/func.html#count > > > > The COUNT function follows syntax semantics and ignores nulls. What this > > means is that a tuple in the bag will not be counted if the FIRST FIELD > in > > this tuple is NULL. If you want to include NULL values in the count > > computation, use > > COUNT_STAR<http://pig.apache.org/docs/r0.10.0/func.html#COUNT-STAR> > > . > > > > There is a proposal to change this though, which provides more context: > > https://issues.apache.org/jira/browse/PIG-1014 > > > > On Tue, Feb 5, 2013 at 2:30 PM, Adair Kovac <[EMAIL PROTECTED]> > wrote: > > > > > Sorry, correcting an imprecision here--field1 is the first field of the > > > records that have been grouped; that it's the first field in the key is > > > nonessential. So basically *any* group/count that I have done in the > past > > > could have been dropping records because the first field happened to be > > > something I didn't care about at the time that could be null. I am > > > distressed by this realization. > > > > > > Thanks again, > > > > > > Adair > > > > > > On Tue, Feb 5, 2013 at 3:14 PM, Adair Kovac <[EMAIL PROTECTED]> > > wrote: > > > > > > > Hi, guys, was wondering what's going on with this. > > > > > > > > In pig 0.9 if I do something like this: > > > > > > > > grouped = group data by (field1, field2); > > > > count = foreach grouped generate COUNT(data); > > > > > > > > That count is 0 wherever field1 is null regardless of what comes > after. > > > > > > > > I can use COUNT_STAR() instead (data fresh from a group won't have > any > > > > null records, right?), but it seems like that should be the expected > > > > behavior of COUNT(). > > > > > > > > This was obviously intended behavior, since it's right there in the > > > > function: > > > > > > > > if (t != null && t.size() > 0 && t.get(0) != null ) > > > > cnt++; Best regards, Vitalii Tymchyshyn |