Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Need help with percentile calculation


Copy link to this message
-
Need help with percentile calculation
Hi,

I am trying to use percentile function of HIVE but getting exception from Amazon EMR service.
I am using version 0.7.

Please assist. It is very critical and urgent.

Below is the code snippet:

CREATE EXTERNAL TABLE IF NOT EXISTS server_d
(
  ag_date STRING,
  median _time BIGINT,
  95percentile _time BIGINT
) COMMENT 'server'
PARTITIONED by (dt STRING, hh STRING, min STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
LOCATION '${hiveconf:S3_INPUT_BUCKET}/server /';

INSERT OVERWRITE TABLE server_d PARTITION(dt='${hiveconf:DT}', hh='${hiveconf:HH}', min='${hiveconf:MIN}')
SELECT '${hiveconf:DT}' as ag_date,
percentile(total _time, 0.50) as median _time,
percentile(total_time, 0.95) as 95percentile_time
from my_log my
where my.date_req = '${hiveconf:DT}' and my.dt = '${hiveconf:DT}';
exception that I am getting:

Exception in thread "Thread-239" java.lang.RuntimeException: Error while reading from task log url
        at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130)
        at org.apache.hadoop.hive.ql.exec.JobDebugger.showJobFailDebugInfo(JobDebugger.java:211)
        at org.apache.hadoop.hive.ql.exec.JobDebugger.run(JobDebugger.java:81)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Server returned HTTP response code: 400 for URL: http://10.***.***.**:9103/tasklog?taskid=attempt_201****0835_0005_m_000001_3&start=-8193
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
        at java.net.URL.openStream(URL.java:1010)
        at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120)
        ... 3 more

Thanks,
Puneet

From: Felix.徐 [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, November 21, 2012 2:22 PM
To: [EMAIL PROTECTED]
Subject: Bugs exist in SEMI JOIN?

Hi,
I am using the version 0.9.0 and my tables are the same with TPC-H benchmark:

Here is a simple query(works correctly):

Q1
INSERT OVERWRITE TABLE customer_orders_statistics
 SELECT C_CUSTKEY FROM CUSTOMER
 LEFT SEMI JOIN(
  SELECT O_CUSTKEY FROM ORDERS WHERE unix_timestamp(O_ORDERDATE, 'yyyy-MM-dd') > unix_timestamp('1995-12-31','yyyy-MM-dd')
 ) tempTable ON tempTable.O_CUSTKEY=CUSTOMER.C_CUSTKEY

it means inserting the key of customers who has orders since 1995-12-31 into another table.
But if I write the query like this:

Q2
INSERT OVERWRITE TABLE customer_orders_statistics
 SELECT C_CUSTKEY FROM CUSTOMER
 LEFT SEMI JOIN ORDERS
 ON CUSTOMER.C_CUSTKEY=ORDERS.O_CUSTKEY
 AND unix_timestamp(ORDERS.O_ORDERDATE, 'yyyy-MM-dd') > unix_timestamp('1995-12-31','yyyy-MM-dd')

I will get exception from Hive:
FAILED: Hive Internal Error: java.lang.NullPointerException(null)
java.lang.NullPointerException
          at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:1566)
          at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.pushJoinFilters(SemanticAnalyzer.java:5254)
          at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:6754)
          at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7531)
          at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
          at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431)
          at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
          at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
          at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
          at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
          at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
          at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:689)
          at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Also,If I write the query like this:
Q3
INSERT OVERWRITE TABLE customer_orders_statistics
 SELECT C_CUSTKEY FROM CUSTOMER
 LEFT SEMI JOIN ORDERS
 ON CUSTOMER.C_CUSTKEY=ORDERS.O_CUSTKEY
 WHERE unix_timestamp(ORDERS.O_ORDERDATE, 'yyyy-MM-dd') > unix_timestamp('1995-12-31','yyyy-MM-dd')

Then this query can be executed(wondering the right hand of SEMI JOIN can be referenced in WHERE clause now?), but the result is wrong(comparing to Q1, Q1's result is the same with mysql).
Any comments or statements made in this email are not necessarily those of Tavant Technologies.
The information transmitted is intended only for the person or entity to which it is addressed and may
contain confidential and/or privileged material. If you have received this in error, please contact the
sender and delete the material from any computer. All e-mails sent from or to Tavant Technologies
may be subject to our monitoring procedures.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB