Has anyone noticed that skew-joins aren't working on Hive 0.11 / Hadoop 0.23?
I've been running the TPC-h benchmarks against Hive 0.11, and I see that none of the queries run through if hive.optimize.skewjoin is set to true.
I initially ran into problems like the following:
Ended Job = job_1371646843240_1214
java.io.FileNotFoundException: File hdfs://fstaxxx.yyy.yahoo.com/tmp/hive_2013-07-12_03-22-31_737_6843191588894968654/-mr-10004/hive_skew_join_bigkeys_0 does not exist.
Patching Hive 0.11 with HIVE-4646 resolved that problem.
What I see now is that a couple of stages of the query run through successfully, after which I get the following message, and the remaining stages are skipped.
2013-07-12 23:21:02,164 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 15985.47 sec
MapReduce Total cumulative CPU time: 0 days 4 hours 26 minutes 25 seconds 470 msec
Ended Job = job_1371646843240_1295
Stage-10 is filtered out by condition resolver.
MapReduce Jobs Launched:
Job 0: Map: 380 Reduce: 118 Cumulative CPU: 15900.35 sec HDFS Read: 24574270287 HDFS Write: 4925478398 SUCCESS
Total MapReduce CPU Time Spent: 0 days 4 hours 25 minutes 0 seconds 350 msec
Time taken: 109.411 seconds
FAILED: SemanticException [Error 10001]: Line 10:5 Table not found 'q16_tmp_cached'
In this particular case, the query is q16_parts_supplier_relationship.hive, part of which looks like:
create table q16_tmp_cached as
p_brand, p_type, p_size, ps_suppkey
partsupp ps join part p
p.p_partkey = ps.ps_partkey and p.p_brand <> 'Brand#45'
and not p.p_type like 'MEDIUM POLISHED%'
join supplier_tmp_cached s
ps.ps_suppkey = s.s_suppkey;
If I can isolate the problem to a smaller test-case, I'll raise a JIRA. I was hoping one of you might have seen this already, or might have a better handle of how skew-joins work in Hive 11.