|
|
-
Bug in nested foreach with ORDER after grouping with multiple keys
Michael Dalton 2010-04-07, 07:08
Hi, I've hit a somewhat obscure bug in the scripts I'm writing caused by the combination of a few factors: multiple column groups, PARALLEL > 1 for grouping, and a nested for-each body following the group that sorts using ORDER. Removing any of these factors (i.e. setting PARALLEL to 1, changing ORDER to a dummy FILTER command, etc) causes the bug to disappear. The end result is that the final GROUP/ORDER occurs with the incorrect group key, causing incorrect output. I have a tiny input file that generates this behavior: http://pastebin.com/UQZkug8Y< http://pastebin.com/UQZkug8Y>Here is a script showing the behavior in question: log = load '/tmp/breakme.txt' USING PigStorage(':') AS (userid:int, email:chararray, subject:chararray, msgid:long); group_email = GROUP log BY (userid, email) PARALLEL 10; email_count = FOREACH group_email GENERATE group.userid, COUNT(log) AS count, group.email; group_user = GROUP email_count BY userid PARALLEL 10; top_for_user = FOREACH group_user { sorted_count = ORDER email_count BY count DESC; GENERATE group, sorted_count; } DUMP top_for_user; The expected output here should be that each (userid, sorted_list) pair should occur once, with the list sorted in descending order by count. However, instead many (userid, partial_fragment_of_sorted_list) pairs appear for the same userid. Interestingly enough, each one of the 'count' fields is correct. If I had to hazard a guess, perhaps the composite key (userid, email) from the first GROUP operation is being re-used or multiple operations are being pushed into the same reducer despite requiring a different ordering/grouping. Here is the (incorrect) output from the above script: (100,{(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED]),(100,1L, [EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED] ),(100,1L,[EMAIL PROTECTED])}) (100,{(100,2L,[EMAIL PROTECTED])}) (101,{(101,1L,[EMAIL PROTECTED]),(101,1L,jakaslf@jlkasfds@.com),(101,1L, [EMAIL PROTECTED])}) Note how there are two entries for userid 100, which should be impossible. Here is the output if I change GROUP email_count BY userid PARALLEL 10 to use PARALLEL 1 instead. This produces the correct/expected result: (100,{(100,2L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED] ),(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED] ),(100,1L,[EMAIL PROTECTED])}) (101,{(101,1L,[EMAIL PROTECTED]),(101,1L,jakaslf@jlkasfds@.com),(101,1L, [EMAIL PROTECTED])}) Let me know if there's anything I can do to further help/fix this issue. Best regards, Mike
-
Re: Bug in nested foreach with ORDER after grouping with multiple keys
Michael Dalton 2010-04-07, 07:45
I have identified the source of the bug: the secondary key optimizations introduced in PIG-1038. If you run Pig with -Dpig.exec.nosecondarykey=true then you get the correct result. I will try to get a patch together. Best regards, Mike On Wed, Apr 7, 2010 at 12:08 AM, Michael Dalton <[EMAIL PROTECTED]> wrote: > Hi, > > I've hit a somewhat obscure bug in the scripts I'm writing caused by the > combination of a few factors: multiple column groups, PARALLEL > 1 for > grouping, and a nested for-each body following the group that sorts using > ORDER. Removing any of these factors (i.e. setting PARALLEL to 1, changing > ORDER to a dummy FILTER command, etc) causes the bug to disappear. The end > result is that the final GROUP/ORDER occurs with the incorrect group key, > causing incorrect output. > > I have a tiny input file that generates this behavior: > http://pastebin.com/UQZkug8Y> < http://pastebin.com/UQZkug8Y>> Here is a script showing the behavior in question: > log = load '/tmp/breakme.txt' USING PigStorage(':') AS (userid:int, > email:chararray, subject:chararray, msgid:long); > group_email = GROUP log BY (userid, email) PARALLEL 10; > email_count = FOREACH group_email GENERATE group.userid, COUNT(log) AS > count, group.email; > group_user = GROUP email_count BY userid PARALLEL 10; > top_for_user = FOREACH group_user { > sorted_count = ORDER email_count BY count DESC; > GENERATE group, sorted_count; > } > DUMP top_for_user; > > The expected output here should be that each (userid, sorted_list) pair > should occur once, with the list sorted in descending order by count. > However, instead many (userid, partial_fragment_of_sorted_list) pairs appear > for the same userid. Interestingly enough, each one of the 'count' fields is > correct. If I had to hazard a guess, perhaps the composite key (userid, > email) from the first GROUP operation is being re-used or multiple > operations are being pushed into the same reducer despite requiring a > different ordering/grouping. > > Here is the (incorrect) output from the above script: > (100,{(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED]),(100,1L, > [EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED] > ),(100,1L,[EMAIL PROTECTED])}) > (100,{(100,2L,[EMAIL PROTECTED])}) > (101,{(101,1L,[EMAIL PROTECTED]),(101,1L,jakaslf@jlkasfds@.com),(101,1L, > [EMAIL PROTECTED])}) > > Note how there are two entries for userid 100, which should be > impossible. Here is the output if I change GROUP email_count BY userid > PARALLEL 10 to use PARALLEL 1 instead. This produces the correct/expected > result: > (100,{(100,2L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED] > ),(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED] > ),(100,1L,[EMAIL PROTECTED])}) > (101,{(101,1L,[EMAIL PROTECTED]),(101,1L,jakaslf@jlkasfds@.com),(101,1L, > [EMAIL PROTECTED])}) > > Let me know if there's anything I can do to further help/fix this issue. > > Best regards, > > Mike >
-
Re: Bug in nested foreach with ORDER after grouping with multiple keys
Michael Dalton 2010-04-07, 12:19
I can confirm that somehow the Partitioner isn't being respected -- SecondaryKeyPartitioner is ignored. This is due to https://issues.apache.org/jira/browse/MAPREDUCE-565. This is not a bug in Pig, it (was) an issue in Hadoop. I just need to upgrade Hadoop to resolve MAPREDUCE-565. Best regards, Mike On Wed, Apr 7, 2010 at 12:45 AM, Michael Dalton <[EMAIL PROTECTED]> wrote: > I have identified the source of the bug: the secondary key optimizations > introduced in PIG-1038. If you run Pig with -Dpig.exec.nosecondarykey=true > then you get the correct result. I will try to get a patch together. > > Best regards, > > Mike > > > On Wed, Apr 7, 2010 at 12:08 AM, Michael Dalton <[EMAIL PROTECTED]>wrote: > >> Hi, >> >> I've hit a somewhat obscure bug in the scripts I'm writing caused by the >> combination of a few factors: multiple column groups, PARALLEL > 1 for >> grouping, and a nested for-each body following the group that sorts using >> ORDER. Removing any of these factors (i.e. setting PARALLEL to 1, changing >> ORDER to a dummy FILTER command, etc) causes the bug to disappear. The end >> result is that the final GROUP/ORDER occurs with the incorrect group key, >> causing incorrect output. >> >> I have a tiny input file that generates this behavior: >> http://pastebin.com/UQZkug8Y>> < http://pastebin.com/UQZkug8Y>>> Here is a script showing the behavior in question: >> log = load '/tmp/breakme.txt' USING PigStorage(':') AS (userid:int, >> email:chararray, subject:chararray, msgid:long); >> group_email = GROUP log BY (userid, email) PARALLEL 10; >> email_count = FOREACH group_email GENERATE group.userid, COUNT(log) AS >> count, group.email; >> group_user = GROUP email_count BY userid PARALLEL 10; >> top_for_user = FOREACH group_user { >> sorted_count = ORDER email_count BY count DESC; >> GENERATE group, sorted_count; >> } >> DUMP top_for_user; >> >> The expected output here should be that each (userid, sorted_list) pair >> should occur once, with the list sorted in descending order by count. >> However, instead many (userid, partial_fragment_of_sorted_list) pairs appear >> for the same userid. Interestingly enough, each one of the 'count' fields is >> correct. If I had to hazard a guess, perhaps the composite key (userid, >> email) from the first GROUP operation is being re-used or multiple >> operations are being pushed into the same reducer despite requiring a >> different ordering/grouping. >> >> Here is the (incorrect) output from the above script: >> (100,{(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED]),(100,1L, >> [EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED] >> ),(100,1L,[EMAIL PROTECTED])}) >> (100,{(100,2L,[EMAIL PROTECTED])}) >> (101,{(101,1L,[EMAIL PROTECTED]),(101,1L,jakaslf@jlkasfds@.com),(101,1L, >> [EMAIL PROTECTED])}) >> >> Note how there are two entries for userid 100, which should be >> impossible. Here is the output if I change GROUP email_count BY userid >> PARALLEL 10 to use PARALLEL 1 instead. This produces the correct/expected >> result: >> (100,{(100,2L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED] >> ),(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED] >> ),(100,1L,[EMAIL PROTECTED])}) >> (101,{(101,1L,[EMAIL PROTECTED]),(101,1L,jakaslf@jlkasfds@.com),(101,1L, >> [EMAIL PROTECTED])}) >> >> Let me know if there's anything I can do to further help/fix this issue. >> >> Best regards, >> >> Mike >> > >
-
Re: Bug in nested foreach with ORDER after grouping with multiple keys
Ashutosh Chauhan 2010-04-07, 16:06
Hi Mike, Glad that you debugged the issue. Once you try it out on upgraded hadoop version, can you let us know whether that resolved your problem or not. It seems issue occurs on hadoop 0.20 and is fixed in hadoop 0.20.1 Ashutosh On Wed, Apr 7, 2010 at 05:19, Michael Dalton <[EMAIL PROTECTED]> wrote: > I can confirm that somehow the Partitioner isn't being respected -- > SecondaryKeyPartitioner is ignored. This is due to > https://issues.apache.org/jira/browse/MAPREDUCE-565. This is not a bug in > Pig, it (was) an issue in Hadoop. I just need to upgrade Hadoop to resolve > MAPREDUCE-565. > > Best regards, > > Mike > > On Wed, Apr 7, 2010 at 12:45 AM, Michael Dalton <[EMAIL PROTECTED]> wrote: > >> I have identified the source of the bug: the secondary key optimizations >> introduced in PIG-1038. If you run Pig with -Dpig.exec.nosecondarykey=true >> then you get the correct result. I will try to get a patch together. >> >> Best regards, >> >> Mike >> >> >> On Wed, Apr 7, 2010 at 12:08 AM, Michael Dalton <[EMAIL PROTECTED]>wrote: >> >>> Hi, >>> >>> I've hit a somewhat obscure bug in the scripts I'm writing caused by the >>> combination of a few factors: multiple column groups, PARALLEL > 1 for >>> grouping, and a nested for-each body following the group that sorts using >>> ORDER. Removing any of these factors (i.e. setting PARALLEL to 1, changing >>> ORDER to a dummy FILTER command, etc) causes the bug to disappear. The end >>> result is that the final GROUP/ORDER occurs with the incorrect group key, >>> causing incorrect output. >>> >>> I have a tiny input file that generates this behavior: >>> http://pastebin.com/UQZkug8Y>>> < http://pastebin.com/UQZkug8Y>>>> Here is a script showing the behavior in question: >>> log = load '/tmp/breakme.txt' USING PigStorage(':') AS (userid:int, >>> email:chararray, subject:chararray, msgid:long); >>> group_email = GROUP log BY (userid, email) PARALLEL 10; >>> email_count = FOREACH group_email GENERATE group.userid, COUNT(log) AS >>> count, group.email; >>> group_user = GROUP email_count BY userid PARALLEL 10; >>> top_for_user = FOREACH group_user { >>> sorted_count = ORDER email_count BY count DESC; >>> GENERATE group, sorted_count; >>> } >>> DUMP top_for_user; >>> >>> The expected output here should be that each (userid, sorted_list) pair >>> should occur once, with the list sorted in descending order by count. >>> However, instead many (userid, partial_fragment_of_sorted_list) pairs appear >>> for the same userid. Interestingly enough, each one of the 'count' fields is >>> correct. If I had to hazard a guess, perhaps the composite key (userid, >>> email) from the first GROUP operation is being re-used or multiple >>> operations are being pushed into the same reducer despite requiring a >>> different ordering/grouping. >>> >>> Here is the (incorrect) output from the above script: >>> (100,{(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED]),(100,1L, >>> [EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED] >>> ),(100,1L,[EMAIL PROTECTED])}) >>> (100,{(100,2L,[EMAIL PROTECTED])}) >>> (101,{(101,1L,[EMAIL PROTECTED]),(101,1L,jakaslf@jlkasfds@.com),(101,1L, >>> [EMAIL PROTECTED])}) >>> >>> Note how there are two entries for userid 100, which should be >>> impossible. Here is the output if I change GROUP email_count BY userid >>> PARALLEL 10 to use PARALLEL 1 instead. This produces the correct/expected >>> result: >>> (100,{(100,2L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED] >>> ),(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED] >>> ),(100,1L,[EMAIL PROTECTED])}) >>> (101,{(101,1L,[EMAIL PROTECTED]),(101,1L,jakaslf@jlkasfds@.com),(101,1L, >>> [EMAIL PROTECTED])}) >>> >>> Let me know if there's anything I can do to further help/fix this issue. >>> >>> Best regards, >>> >>> Mike >>> >> >> >
-
Re: Bug in nested foreach with ORDER after grouping with multiple keys
Michael Dalton 2010-04-08, 09:56
Thanks Ashutosh, I can confirm this issue was resolved by upgrading to the latest stable Hadoop build, 0.20.2. The cause was definitely MAPREDUCE-565. Best regards, Mike On Wed, Apr 7, 2010 at 9:06 AM, Ashutosh Chauhan <[EMAIL PROTECTED] > wrote: > Hi Mike, > > Glad that you debugged the issue. Once you try it out on upgraded > hadoop version, can you let us know whether that resolved your problem > or not. It seems issue occurs on hadoop 0.20 and is fixed in hadoop > 0.20.1 > > Ashutosh > > On Wed, Apr 7, 2010 at 05:19, Michael Dalton <[EMAIL PROTECTED]> wrote: > > I can confirm that somehow the Partitioner isn't being respected -- > > SecondaryKeyPartitioner is ignored. This is due to > > https://issues.apache.org/jira/browse/MAPREDUCE-565. This is not a bug > in > > Pig, it (was) an issue in Hadoop. I just need to upgrade Hadoop to > resolve > > MAPREDUCE-565. > > > > Best regards, > > > > Mike > > > > On Wed, Apr 7, 2010 at 12:45 AM, Michael Dalton <[EMAIL PROTECTED]> > wrote: > > > >> I have identified the source of the bug: the secondary key optimizations > >> introduced in PIG-1038. If you run Pig with > -Dpig.exec.nosecondarykey=true > >> then you get the correct result. I will try to get a patch together. > >> > >> Best regards, > >> > >> Mike > >> > >> > >> On Wed, Apr 7, 2010 at 12:08 AM, Michael Dalton <[EMAIL PROTECTED] > >wrote: > >> > >>> Hi, > >>> > >>> I've hit a somewhat obscure bug in the scripts I'm writing caused by > the > >>> combination of a few factors: multiple column groups, PARALLEL > 1 for > >>> grouping, and a nested for-each body following the group that sorts > using > >>> ORDER. Removing any of these factors (i.e. setting PARALLEL to 1, > changing > >>> ORDER to a dummy FILTER command, etc) causes the bug to disappear. The > end > >>> result is that the final GROUP/ORDER occurs with the incorrect group > key, > >>> causing incorrect output. > >>> > >>> I have a tiny input file that generates this behavior: > >>> http://pastebin.com/UQZkug8Y> >>> < http://pastebin.com/UQZkug8Y>> >>> Here is a script showing the behavior in question: > >>> log = load '/tmp/breakme.txt' USING PigStorage(':') AS (userid:int, > >>> email:chararray, subject:chararray, msgid:long); > >>> group_email = GROUP log BY (userid, email) PARALLEL 10; > >>> email_count = FOREACH group_email GENERATE group.userid, COUNT(log) > AS > >>> count, group.email; > >>> group_user = GROUP email_count BY userid PARALLEL 10; > >>> top_for_user = FOREACH group_user { > >>> sorted_count = ORDER email_count BY count DESC; > >>> GENERATE group, sorted_count; > >>> } > >>> DUMP top_for_user; > >>> > >>> The expected output here should be that each (userid, sorted_list) pair > >>> should occur once, with the list sorted in descending order by count. > >>> However, instead many (userid, partial_fragment_of_sorted_list) pairs > appear > >>> for the same userid. Interestingly enough, each one of the 'count' > fields is > >>> correct. If I had to hazard a guess, perhaps the composite key (userid, > >>> email) from the first GROUP operation is being re-used or multiple > >>> operations are being pushed into the same reducer despite requiring a > >>> different ordering/grouping. > >>> > >>> Here is the (incorrect) output from the above script: > >>> (100,{(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED]),(100,1L, > >>> [EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED] > >>> ),(100,1L,[EMAIL PROTECTED])}) > >>> (100,{(100,2L,[EMAIL PROTECTED])}) > >>> (101,{(101,1L,[EMAIL PROTECTED]),(101,1L,jakaslf@jlkasfds > @.com),(101,1L, > >>> [EMAIL PROTECTED])}) > >>> > >>> Note how there are two entries for userid 100, which should be > >>> impossible. Here is the output if I change GROUP email_count BY userid > >>> PARALLEL 10 to use PARALLEL 1 instead. This produces the > correct/expected > >>> result: > >>> (100,{(100,2L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED]),(100,1L, > [EMAIL PROTECTED] > >>> ),(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED]),(100,1L,
|
|