I wanted to look into improving performance of my Hive cluster, and from
what I read turning on compression of intermediate data could help. As I
understand, this would help because it would reduce the amount of data
written to disk in between jobs.
I look at the documentation and set the following settings:
I ran some queries to see how compression impacts the performance. But it
usually made the query time worse. I also had a query whose size of
intermediate data was close to the size of input data, but it made the
performance worse for this query too.
Question 1: Are the above settings correct settings for using compression
of intermediate data?
Question 2: Are there use-cases in which compression of intermediate data
would not help performance? Why would someone not keep it turned on always?