Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Compression of Intermediate Data


Copy link to this message
-
Re: Compression of Intermediate Data
Hi Hadi

The propertis you specified doen't enable compression of map output. To enable map output compression you need to enable the following properties

SET hive.exec.compress.output=true;

SET mapred.map.output.compression=true;
SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
This property 'hive.exec.compress.intermediate
' Is used to enable compression of data in between multiple mapreduce jobs generated by a hive query.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Hadi Moshayedi <[EMAIL PROTECTED]>
Date: Sat, 6 Oct 2012 16:55:47
To: <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Subject: Compression of Intermediate Data

I wanted to look into improving performance of my Hive cluster, and from
what I read turning on compression of intermediate data could help. As I
understand, this would help because it would reduce the amount of data
written to disk in between jobs.

I look at the documentation and set the following settings:

SET hive.exec.compress.intermediate=true;
SET mapred.output.compression.type=BLOCK;
SET
mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

 I ran some queries to see how compression impacts the performance. But it
usually made the query time worse. I also had a query whose size of
intermediate data was close to the size of input data, but it made the
performance worse for this query too.

 Question 1: Are the above settings correct settings for using compression
of intermediate data?

 Question 2: Are there use-cases in which compression of intermediate data
would not help performance? Why would someone not keep it turned on always?

Thanks

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB