Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Zebra, RC and Text size comparison


Copy link to this message
-
Zebra, RC and Text size comparison
Hi all,

 I have some data in Zebra around 9 TB which I converted first to
PlainText using the TextOutputFormat in M/R and it resulted in around
43.07TB. [[I think I used no compression here.]]

I then later converted this data to RC using on the hive console as:

 

CREATE TABLE LARGERC

   ROW FORMAT SERDE
"org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"

   STORED AS RCFile

   LOCATION '/user/viraj/huge AS

SELECT * FROM PLAINTEXT;

 

(PLAINTEXT is the external table which is 43.07 TB in size)

 

The overall sizes of these files were around 41.65 TB. I am suspecting
that some compression was not being applied.

 

I read the following documentation:

http://hadoop.apache.org/hive/docs/r0.4.0/api/org/apache/hadoop/hive/ql/
io/RCFile.html and it tells that: "The actual compression algorithm used
to compress key and/or values can be specified by using the appropriate
CompressionCodec"

 

a)       What is the default Codec that is being used?

b)       Any thoughts on how I can reduce the size?

Viraj

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB