|
|
-
Compression of Intermediate Data
Hadi Moshayedi 2012-10-06, 13:55
I wanted to look into improving performance of my Hive cluster, and from what I read turning on compression of intermediate data could help. As I understand, this would help because it would reduce the amount of data written to disk in between jobs.
I look at the documentation and set the following settings:
SET hive.exec.compress.intermediate=true; SET mapred.output.compression.type=BLOCK; SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
I ran some queries to see how compression impacts the performance. But it usually made the query time worse. I also had a query whose size of intermediate data was close to the size of input data, but it made the performance worse for this query too.
Question 1: Are the above settings correct settings for using compression of intermediate data?
Question 2: Are there use-cases in which compression of intermediate data would not help performance? Why would someone not keep it turned on always?
Thanks
+
Hadi Moshayedi 2012-10-06, 13:55
-
Re: Compression of Intermediate Data
Bejoy KS 2012-10-06, 15:32
Hi Hadi
The propertis you specified doen't enable compression of map output. To enable map output compression you need to enable the following properties
SET hive.exec.compress.output=true; SET mapred.map.output.compression=true; SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; This property 'hive.exec.compress.intermediate ' Is used to enable compression of data in between multiple mapreduce jobs generated by a hive query.
Regards Bejoy KS
Sent from handheld, please excuse typos.
-----Original Message----- From: Hadi Moshayedi <[EMAIL PROTECTED]> Date: Sat, 6 Oct 2012 16:55:47 To: <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: Compression of Intermediate Data
I wanted to look into improving performance of my Hive cluster, and from what I read turning on compression of intermediate data could help. As I understand, this would help because it would reduce the amount of data written to disk in between jobs.
I look at the documentation and set the following settings:
SET hive.exec.compress.intermediate=true; SET mapred.output.compression.type=BLOCK; SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
I ran some queries to see how compression impacts the performance. But it usually made the query time worse. I also had a query whose size of intermediate data was close to the size of input data, but it made the performance worse for this query too.
Question 1: Are the above settings correct settings for using compression of intermediate data?
Question 2: Are there use-cases in which compression of intermediate data would not help performance? Why would someone not keep it turned on always?
Thanks
+
Bejoy KS 2012-10-06, 15:32
-
Re: Compression of Intermediate Data
Hadi Moshayedi 2012-10-08, 15:17
Hi Bejoy, Thanks.
Following your instructions, I also enabled map output compression.
I tried different queries but I couldn't get the benefit from compression in any single of them. I also tried creating queries which have large intermediate data, but it didn't improve the performance for them either.
I should also note that our Hadoop cluster is setup at few Amazon EC2 m2.2xlarge instances.
Question is: What are the scenarios in which compression can improve the performance? Thanks, -- Hadi
On Sat, Oct 6, 2012 at 6:32 PM, Bejoy KS <[EMAIL PROTECTED]> wrote:
> ** > Hi Hadi > > The propertis you specified doen't enable compression of map output. To > enable map output compression you need to enable the following properties > > SET hive.exec.compress.output=true; > > SET mapred.map.output.compression=true; > SET > mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; > > > This property 'hive.exec.compress.intermediate > ' Is used to enable compression of data in between multiple mapreduce jobs > generated by a hive query. > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > ------------------------------ > *From: * Hadi Moshayedi <[EMAIL PROTECTED]> > *Date: *Sat, 6 Oct 2012 16:55:47 +0300 > *To: *<[EMAIL PROTECTED]> > *ReplyTo: * [EMAIL PROTECTED] > *Subject: *Compression of Intermediate Data > > I wanted to look into improving performance of my Hive cluster, and from > what I read turning on compression of intermediate data could help. As I > understand, this would help because it would reduce the amount of data > written to disk in between jobs. > > I look at the documentation and set the following settings: > > SET hive.exec.compress.intermediate=true; > SET mapred.output.compression.type=BLOCK; > SET > mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; > > I ran some queries to see how compression impacts the performance. But it > usually made the query time worse. I also had a query whose size of > intermediate data was close to the size of input data, but it made the > performance worse for this query too. > > Question 1: Are the above settings correct settings for using compression > of intermediate data? > > Question 2: Are there use-cases in which compression of intermediate data > would not help performance? Why would someone not keep it turned on always? > > Thanks >
+
Hadi Moshayedi 2012-10-08, 15:17
|
|