I have a table with 10 million rows and 2 columns - id (int) and element (string). I am trying to do a self join that finds any ids where the element values are the same, and my query looks like:
select e1.id, e1.tag, e2.id as id2, e2.tag as tag2 from elements e1 JOIN elements e2 on e1.element = e2.element WHERE e1.id < e2.id;
I tested this at a smaller scale and it works well. The problem is that with 10 million rows, this becomes a bit large and I've let it run for 90 minutes and it was up to 80GB of disk space and still going. The original input data was only 500MB.
Is this something I can optimize in hive? Or should I be considering a different approach to the problem instead?
so that's your final assessment, eh? :) What is your comment about the outer query _joining on value_ to get the key? On Thu, Mar 20, 2014 at 12:26 PM, Jeff Storey <[EMAIL PROTECTED]> wrote:
NEW: Monitor These Apps!
Apache Lucene, Apache Solr and all other Apache Software Foundation projects and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext