Dragos Munteanu 2011-02-18, 20:26
Thejas M Nair 2011-02-18, 21:08
Dragos Munteanu 2011-02-22, 23:53
Can you please open a jira with this information ? - https://issues.apache.org/jira/browse/PIG
If you are able to create a sample script/data that can reproduce this issue, that will also be very useful.
As a workaround, you can probably split the query into independent queries each having a smaller number of group-by-and-sum .
On 2/22/11 3:53 PM, "Dragos Munteanu" <[EMAIL PROTECTED]> wrote:
I tried the patch that you mentioned, that solves issue
It helps a little, but as I try more complex scripts, the multiquery failures come back. Below are the details.
I'm running pig compiled from http://svn.apache.org/repos/asf/pig/branches/branch-0.8
checked out on Feb. 18, compiled with jdk1.6.0_24
My script does the following:
- read from disk a relation where each tuple has 10 fields, one of which is
- take each non-count field in turn, group by it, and sum the counts for
Initially my script computed 5 such group-by-and-sum, which failed on the non-patched pig-0.8.
With the patch, this script worked just fine.
I then ran a script that does 15 group-by-and-sum (grouping also by pairs of fields). In this run, a couple of reducer attempts failed (Map output copy failure : java.lang.OutOfMemoryError: Java heap space) but the job as a whole succeeded.
I then ran a script that also does 15 group-bys, but for each group it performs a more complex computation (I provide a code example below). This time the job fails, and quickly. Just like above, a bunch of reducers fail with the "Java heap space" error; and the log of the entire job says:
JobId Alias Feature Message Outputs
Failed to read data from "hdfs://pig1/user/dmunteanu/RuleProcess.xTCxi/rules"
I'm not sure why it complains about "failed to read data"; my best guess is that it's because the job fails even before all mappers could be run.
The exact same script runs just fine with "no_multiquery", so the problem has to come from the multiquery optimization.
Below is a sample of my script.
Basically, it groups by something and then:
* sums up all the counts for the members of the group
* computes, for all members of the group, counts-of-counts (i.e. how many tuples in the group have the same count as the current tuple)
The example shows the computation for one group; this is the code is then repeated (with different relation names) for the other groups
-- compute totals
statT_rules = FOREACH merged_rules GENERATE root, count;
statT_rules_grouped = GROUP statT_rules BY root PARALLEL 30;
statT_totals = FOREACH statT_rules_grouped GENERATE FLATTEN(group), SUM(statT_rules.count) AS total;
statT_tcounts = FOREACH statT_rules GENERATE root, count, (count >= 5 ? 5 : count) as tcount;
statT_tcounts_grouped = GROUP statT_tcounts BY (root,tcount) PARALLEL 30;
statT_ccounts = FOREACH statT_tcounts_grouped GENERATE FLATTEN(group), COUNT(statT_tcounts) AS ccount;
statT_joined = JOIN statT_totals BY group, statT_ccounts BY root;
statT_joined_filtered = FOREACH statT_joined GENERATE statT_totals::group AS root, statT_totals::total AS total, statT_ccounts::group::tcount AS tcount, statT_ccounts::ccount AS ccount;
statT_joined_grouped = GROUP statT_joined_filtered BY (root,total) PARALLEL 30;
statT_joined_print = FOREACH statT_joined_grouped GENERATE FLATTEN(group), statT_joined_filtered.(tcount,ccount);
STORE statT_joined_print INTO 'RuleProcess.xTCxi.2/stats.root' using PigStorage;
On 2/18/11 1:08 PM, "Thejas M Nair" <[EMAIL PROTECTED]> wrote:
SDL PLC confidential, all rights reserved. If you are not the intended recipient of this mail SDL requests and requires that you delete it without acting upon or copying any of its contents, and we further request that you advise us.
SDL PLC is a public limited company registered in England and Wales. Registered number: 02675207.
Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, UK.