|
|
-
Possible bug in NULL fields handling
Vincent BARAT 2009-10-15, 12:51
Hello,
I'm not sure if it's a bug, but the handling of NULL fields seems not to work correctly:
My data (events):
0,,jawi ,0,juug ,,lfou 0,0,caro
My script:
events = load 'events' using PigStorage(',') AS (sessionid:chararray, jobid:chararray, user:chararray); user_events = group events by user; dump user_events; event_count_by_user = foreach user_events generate group, COUNT(events); dump event_count_by_user;
The results:
user_events (correct): (caro,{(0,0,caro)}) (jawi,{(0,,jawi)}) (juug,{(,0,juug)}) (lfou,{(,,lfou)})
event_count_by_user (incorrect): (caro,1L) (jawi,1L) (juug,0L) (lfou,0L)
event_count_by_user should be:
(caro,1L) (jawi,1L) (juug,1L) (lfou,1L)
It seems that tuples starting with (, are not counted correctly.
Any suggestion?
Thanks a lot
-
Re: Possible bug in NULL fields handling
Dmitriy Ryaboy 2009-10-15, 13:09
Currently, COUNT of a bag will ignore bags which have the first field as null (this stems from the fact that COUNT of a column will count non-null columns, for sql compatibility). You may want to try using COUNT_STAR. This behavior is currently being reconsidered: https://issues.apache.org/jira/browse/PIG-1014 (please provide input!) -Dmitriy On Thu, Oct 15, 2009 at 8:51 AM, Vincent BARAT <[EMAIL PROTECTED]> wrote: > Hello, > > I'm not sure if it's a bug, but the handling of NULL fields seems not to > work correctly: > > My data (events): > > 0,,jawi > ,0,juug > ,,lfou > 0,0,caro > > My script: > > events = load 'events' using PigStorage(',') AS (sessionid:chararray, > jobid:chararray, user:chararray); > user_events = group events by user; > dump user_events; > event_count_by_user = foreach user_events generate group, COUNT(events); > dump event_count_by_user; > > The results: > > user_events (correct): > (caro,{(0,0,caro)}) > (jawi,{(0,,jawi)}) > (juug,{(,0,juug)}) > (lfou,{(,,lfou)}) > > event_count_by_user (incorrect): > (caro,1L) > (jawi,1L) > (juug,0L) > (lfou,0L) > > event_count_by_user should be: > > (caro,1L) > (jawi,1L) > (juug,1L) > (lfou,1L) > > It seems that tuples starting with (, are not counted correctly. > > Any suggestion? > > Thanks a lot > > >
-
Re: Possible bug in NULL fields handling
Vincent BARAT 2009-10-15, 13:40
Thank you very much for your answer! I was not aware about the COUNT_STAR() function. I guess it has been introduced recently (otherwise it is a bug in the documentation :-) Anyway, the end proposal in PIG-1014 seems ok to me. At least, I think that the current behavior of the COUNT when applied on bags is misleading. Dmitriy Ryaboy a �crit : > Currently, COUNT of a bag will ignore bags which have the first field > as null (this stems from the fact that COUNT of a column will count > non-null columns, for sql compatibility). You may want to try using > COUNT_STAR. This behavior is currently being reconsidered: > https://issues.apache.org/jira/browse/PIG-1014 (please provide input!) > > -Dmitriy > > On Thu, Oct 15, 2009 at 8:51 AM, Vincent BARAT <[EMAIL PROTECTED]> wrote: >> Hello, >> >> I'm not sure if it's a bug, but the handling of NULL fields seems not to >> work correctly: >> >> My data (events): >> >> 0,,jawi >> ,0,juug >> ,,lfou >> 0,0,caro >> >> My script: >> >> events = load 'events' using PigStorage(',') AS (sessionid:chararray, >> jobid:chararray, user:chararray); >> user_events = group events by user; >> dump user_events; >> event_count_by_user = foreach user_events generate group, COUNT(events); >> dump event_count_by_user; >> >> The results: >> >> user_events (correct): >> (caro,{(0,0,caro)}) >> (jawi,{(0,,jawi)}) >> (juug,{(,0,juug)}) >> (lfou,{(,,lfou)}) >> >> event_count_by_user (incorrect): >> (caro,1L) >> (jawi,1L) >> (juug,0L) >> (lfou,0L) >> >> event_count_by_user should be: >> >> (caro,1L) >> (jawi,1L) >> (juug,1L) >> (lfou,1L) >> >> It seems that tuples starting with (, are not counted correctly. >> >> Any suggestion? >> >> Thanks a lot >> >> >> > >
|
|