Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Bugs exist in SEMI JOIN?


Copy link to this message
-
Bugs exist in SEMI JOIN?
Hi,
I am using the version 0.9.0 and my tables are the same with TPC-H
benchmark:

Here is a simple query(works correctly):

*Q1*
INSERT OVERWRITE TABLE customer_orders_statistics
 SELECT C_CUSTKEY FROM CUSTOMER
 LEFT SEMI JOIN(
  SELECT O_CUSTKEY FROM ORDERS WHERE unix_timestamp(O_ORDERDATE,
'yyyy-MM-dd') > unix_timestamp('1995-12-31','yyyy-MM-dd')
 ) tempTable ON tempTable.O_CUSTKEY=CUSTOMER.C_CUSTKEY

it means inserting the key of customers who has orders since 1995-12-31
into another table.
But if I write the query like this:

*Q2*
INSERT OVERWRITE TABLE customer_orders_statistics
 SELECT C_CUSTKEY FROM CUSTOMER
 LEFT SEMI JOIN ORDERS
 ON CUSTOMER.C_CUSTKEY=ORDERS.O_CUSTKEY
 *AND *unix_timestamp(ORDERS.O_ORDERDATE, 'yyyy-MM-dd') >
unix_timestamp('1995-12-31','yyyy-MM-dd')

I will get exception from Hive:
FAILED: Hive Internal Error: java.lang.NullPointerException(null)
java.lang.NullPointerException
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:1566)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.pushJoinFilters(SemanticAnalyzer.java:5254)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:6754)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7531)
at
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:689)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Also,If I write the query like this:
*Q3*
INSERT OVERWRITE TABLE customer_orders_statistics
 SELECT C_CUSTKEY FROM CUSTOMER
 LEFT SEMI JOIN ORDERS
 ON CUSTOMER.C_CUSTKEY=ORDERS.O_CUSTKEY
 *WHERE *unix_timestamp(ORDERS.O_ORDERDATE, 'yyyy-MM-dd') >
unix_timestamp('1995-12-31','yyyy-MM-dd')

Then this query can be executed(wondering the right hand of SEMI JOIN can
be referenced in WHERE clause now?), but the result is wrong(comparing to *Q1,
Q1*'s result is the same with mysql).
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB