Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Compression of Intermediate Data


+
Hadi Moshayedi 2012-10-06, 13:55
+
Bejoy KS 2012-10-06, 15:32
Copy link to this message
-
Re: Compression of Intermediate Data
Hi Bejoy,
  Thanks.

  Following your instructions, I also enabled map output compression.

  I tried different queries but I couldn't get the benefit from compression
in any single of them. I also tried creating queries which have large
intermediate data, but it didn't improve the performance for them either.

  I should also note that our Hadoop cluster is setup at few Amazon EC2
m2.2xlarge instances.

  Question is: What are the scenarios in which compression can improve the
performance?
 Thanks,
   -- Hadi

On Sat, Oct 6, 2012 at 6:32 PM, Bejoy KS <[EMAIL PROTECTED]> wrote:

> **
> Hi Hadi
>
> The propertis you specified doen't enable compression of map output. To
> enable map output compression you need to enable the following properties
>
> SET hive.exec.compress.output=true;
> ‪
> SET mapred.map.output.compression=true;
> SET
> mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
>
>
> This property 'hive.exec.compress.intermediate
> ' Is used to enable compression of data in between multiple mapreduce jobs
> generated by a hive query.
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Hadi Moshayedi <[EMAIL PROTECTED]>
> *Date: *Sat, 6 Oct 2012 16:55:47 +0300
> *To: *<[EMAIL PROTECTED]>
> *ReplyTo: * [EMAIL PROTECTED]
> *Subject: *Compression of Intermediate Data
>
> I wanted to look into improving performance of my Hive cluster, and from
> what I read turning on compression of intermediate data could help. As I
> understand, this would help because it would reduce the amount of data
> written to disk in between jobs.
>
> I look at the documentation and set the following settings:
>
> SET hive.exec.compress.intermediate=true;
> SET mapred.output.compression.type=BLOCK;
> SET
> mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
>
>  I ran some queries to see how compression impacts the performance. But it
> usually made the query time worse. I also had a query whose size of
> intermediate data was close to the size of input data, but it made the
> performance worse for this query too.
>
>  Question 1: Are the above settings correct settings for using compression
> of intermediate data?
>
>  Question 2: Are there use-cases in which compression of intermediate data
> would not help performance? Why would someone not keep it turned on always?
>
> Thanks
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB