Not really. If it was a single column with no counters, Hive provides an option to use percentile. So basically if the data was like -
100 100 200 200 200 200 300
But if we have 2 columns, one that maintain the value and the other that maintains the count, how can Hive be used to derive the percentile?
Value Count 100 2 200 4 300 1
From: Stephen Sprague <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> Reply-To: "[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> Date: Thursday, March 20, 2014 5:28 AM To: "[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> Subject: Re: computing median and percentiles
not a hive question is it? its more like a math question.
On Wed, Mar 19, 2014 at 1:30 PM, Seema Datar <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: I understand the percentile function is supported in Hive in the latest versions. However, how does once calculate percentiles when the data is across two columns. So say -
100 2 ( so basically 100 occurred twice) 200 4 300 1 400 6 500 3 I want to find out the 0.25 percentile for the value distribution. How can I do it using the Hive percentile function?
off the top of my head i'm not sure how easy this would be in SQL but i imagine using ROW_NUMBER() function a clever person could do it. And if not SQL then perhaps a custom UDF but ultimately you have to do the work and now you have a potential recipe to follow. :) On Wed, Mar 19, 2014 at 9:37 PM, Seema Datar <[EMAIL PROTECTED]> wrote:
NEW: Monitor These Apps!
Apache Lucene, Apache Solr and all other Apache Software Foundation projects and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext