Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Bug in nested foreach with ORDER after grouping with multiple keys

Copy link to this message
Re: Bug in nested foreach with ORDER after grouping with multiple keys
Hi Mike,

Glad that you debugged the issue. Once you try it out on upgraded
hadoop version, can you let us know whether that resolved your problem
or not. It seems issue occurs on hadoop 0.20 and is fixed in hadoop


On Wed, Apr 7, 2010 at 05:19, Michael Dalton <[EMAIL PROTECTED]> wrote:
> I can confirm that somehow the Partitioner isn't being respected --
> SecondaryKeyPartitioner is ignored. This is due to
> https://issues.apache.org/jira/browse/MAPREDUCE-565. This is not a bug in
> Pig, it (was) an issue in Hadoop. I just need to upgrade Hadoop to resolve
> Best regards,
> Mike
> On Wed, Apr 7, 2010 at 12:45 AM, Michael Dalton <[EMAIL PROTECTED]> wrote:
>> I have identified the source of the bug: the secondary key optimizations
>> introduced in PIG-1038. If you run Pig with -Dpig.exec.nosecondarykey=true
>> then you get the correct result. I will try to get a patch together.
>> Best regards,
>> Mike
>> On Wed, Apr 7, 2010 at 12:08 AM, Michael Dalton <[EMAIL PROTECTED]>wrote:
>>> Hi,
>>> I've hit a somewhat obscure bug in the scripts I'm writing caused by the
>>> combination of a few factors: multiple column groups, PARALLEL > 1 for
>>> grouping, and a nested for-each body following the group that sorts using
>>> ORDER. Removing any of these factors (i.e. setting PARALLEL to 1, changing
>>> ORDER to a dummy FILTER command, etc) causes the bug to disappear. The end
>>> result is that the final GROUP/ORDER occurs with the incorrect group key,
>>> causing incorrect output.
>>> I have a tiny input file that generates this behavior:
>>> http://pastebin.com/UQZkug8Y
>>> <http://pastebin.com/UQZkug8Y>
>>> Here is a script showing the behavior in question:
>>>   log = load '/tmp/breakme.txt' USING PigStorage(':') AS (userid:int,
>>> email:chararray, subject:chararray, msgid:long);
>>>   group_email = GROUP log BY (userid, email) PARALLEL 10;
>>>   email_count = FOREACH group_email GENERATE group.userid, COUNT(log) AS
>>> count, group.email;
>>>   group_user = GROUP email_count BY userid PARALLEL 10;
>>>   top_for_user = FOREACH group_user {
>>>     sorted_count = ORDER email_count BY count DESC;
>>>     GENERATE group, sorted_count;
>>>   }
>>>   DUMP top_for_user;
>>> The expected output here should be that each (userid, sorted_list) pair
>>> should occur once, with the list sorted in descending order by count.
>>> However, instead many (userid, partial_fragment_of_sorted_list) pairs appear
>>> for the same userid. Interestingly enough, each one of the 'count' fields is
>>> correct. If I had to hazard a guess, perhaps the composite key (userid,
>>> email) from the first GROUP operation is being re-used or multiple
>>> operations are being pushed into the same reducer despite requiring a
>>> different ordering/grouping.
>>> Here is the (incorrect) output from the above script:
>>>  (100,{(100,1L,[EMAIL PROTECTED]),(100,1L,[EMAIL PROTECTED]),(100,1L,
>>> ),(100,1L,[EMAIL PROTECTED])})
>>> (100,{(100,2L,[EMAIL PROTECTED])})
>>> (101,{(101,1L,[EMAIL PROTECTED]),(101,1L,jakaslf@jlkasfds@.com),(101,1L,
>>> Note how there are two entries for userid 100, which should be
>>> impossible. Here is the output if I change GROUP email_count BY userid
>>> PARALLEL 10 to use PARALLEL 1 instead. This produces the correct/expected
>>> result:
>>> ),(100,1L,[EMAIL PROTECTED])})
>>> (101,{(101,1L,[EMAIL PROTECTED]),(101,1L,jakaslf@jlkasfds@.com),(101,1L,
>>> Let me know if there's anything I can do to further help/fix this issue.
>>> Best regards,
>>> Mike