I have a table with 10 million rows and 2 columns - id (int) and element (string). I am trying to do a self join that finds any ids where the element values are the same, and my query looks like:
select e1.id, e1.tag, e2.id as id2, e2.tag as tag2 from elements e1 JOIN elements e2 on e1.element = e2.element WHERE e1.id < e2.id;
I tested this at a smaller scale and it works well. The problem is that with 10 million rows, this becomes a bit large and I've let it run for 90 minutes and it was up to 80GB of disk space and still going. The original input data was only 500MB.
Is this something I can optimize in hive? Or should I be considering a different approach to the problem instead?