Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # dev >> Skew Joins borked on Hive11 (Hadoop23)?


Copy link to this message
-
Skew Joins borked on Hive11 (Hadoop23)?
Hello, all.

Has anyone noticed that skew-joins aren't working on Hive 0.11 / Hadoop 0.23?

I've been running the TPC-h benchmarks against Hive 0.11, and I see that none of the queries run through if hive.optimize.skewjoin is set to true.

I initially ran into problems like the following:

<quote>
Ended Job = job_1371646843240_1214
java.io.FileNotFoundException: File hdfs://fstaxxx.yyy.yahoo.com/tmp/hive_2013-07-12_03-22-31_737_6843191588894968654/-mr-10004/hive_skew_join_bigkeys_0 does not exist.
</quote> 

Patching Hive 0.11 with HIVE-4646 resolved that problem.

What I see now is that a couple of stages of the query run through successfully, after which I get the following message, and the remaining stages are skipped.

<quote>
2013-07-12 23:21:02,164 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 15985.47 sec
MapReduce Total cumulative CPU time: 0 days 4 hours 26 minutes 25 seconds 470 msec
Ended Job = job_1371646843240_1295
Stage-10 is filtered out by condition resolver.
MapReduce Jobs Launched:
Job 0: Map: 380  Reduce: 118   Cumulative CPU: 15900.35 sec   HDFS Read: 24574270287 HDFS Write: 4925478398 SUCCESS
Total MapReduce CPU Time Spent: 0 days 4 hours 25 minutes 0 seconds 350 msec
OK
Time taken: 109.411 seconds
FAILED: SemanticException [Error 10001]: Line 10:5 Table not found 'q16_tmp_cached'
</quote>

In this particular case, the query is q16_parts_supplier_relationship.hive, part of which looks like:

<quote>
create table q16_tmp_cached as
select
  p_brand, p_type, p_size, ps_suppkey
from
  partsupp ps join part p
  on
    p.p_partkey = ps.ps_partkey and p.p_brand <> 'Brand#45'
    and not p.p_type like 'MEDIUM POLISHED%'
  join supplier_tmp_cached s
  on
    ps.ps_suppkey = s.s_suppkey;
</quote>

If I can isolate the problem to a smaller test-case, I'll raise a JIRA. I was hoping one of you might have seen this already, or might have a better handle of how skew-joins work in Hive 11.

Many thanks,
Mithun
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB