I have some data in Zebra around 9 TB which I converted first to
PlainText using the TextOutputFormat in M/R and it resulted in around
43.07TB. [[I think I used no compression here.]]
I then later converted this data to RC using on the hive console as:
CREATE TABLE LARGERC
ROW FORMAT SERDE
STORED AS RCFile
LOCATION '/user/viraj/huge AS
SELECT * FROM PLAINTEXT;
(PLAINTEXT is the external table which is 43.07 TB in size)
The overall sizes of these files were around 41.65 TB. I am suspecting
that some compression was not being applied.
I read the following documentation:
io/RCFile.html and it tells that: "The actual compression algorithm used
to compress key and/or values can be specified by using the appropriate
a) What is the default Codec that is being used?
b) Any thoughts on how I can reduce the size?