Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Compression of Intermediate Data

Copy link to this message
Re: Compression of Intermediate Data
Hi Hadi

The propertis you specified doen't enable compression of map output. To enable map output compression you need to enable the following properties

SET hive.exec.compress.output=true;

SET mapred.map.output.compression=true;
SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
This property 'hive.exec.compress.intermediate
' Is used to enable compression of data in between multiple mapreduce jobs generated by a hive query.

Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Hadi Moshayedi <[EMAIL PROTECTED]>
Date: Sat, 6 Oct 2012 16:55:47
Subject: Compression of Intermediate Data

I wanted to look into improving performance of my Hive cluster, and from
what I read turning on compression of intermediate data could help. As I
understand, this would help because it would reduce the amount of data
written to disk in between jobs.

I look at the documentation and set the following settings:

SET hive.exec.compress.intermediate=true;
SET mapred.output.compression.type=BLOCK;

 I ran some queries to see how compression impacts the performance. But it
usually made the query time worse. I also had a query whose size of
intermediate data was close to the size of input data, but it made the
performance worse for this query too.

 Question 1: Are the above settings correct settings for using compression
of intermediate data?

 Question 2: Are there use-cases in which compression of intermediate data
would not help performance? Why would someone not keep it turned on always?