Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Zebra, RC and Text size comparison


Copy link to this message
-
Zebra, RC and Text size comparison
Hi all,

 I have some data in Zebra around 9 TB which I converted first to
PlainText using the TextOutputFormat in M/R and it resulted in around
43.07TB. [[I think I used no compression here.]]

I then later converted this data to RC using on the hive console as:

 

CREATE TABLE LARGERC

   ROW FORMAT SERDE
"org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"

   STORED AS RCFile

   LOCATION '/user/viraj/huge AS

SELECT * FROM PLAINTEXT;

 

(PLAINTEXT is the external table which is 43.07 TB in size)

 

The overall sizes of these files were around 41.65 TB. I am suspecting
that some compression was not being applied.

 

I read the following documentation:

http://hadoop.apache.org/hive/docs/r0.4.0/api/org/apache/hadoop/hive/ql/
io/RCFile.html and it tells that: "The actual compression algorithm used
to compress key and/or values can be specified by using the appropriate
CompressionCodec"

 

a)       What is the default Codec that is being used?

b)       Any thoughts on how I can reduce the size?

Viraj