Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> SequenceFile compression on Amazon EMR not very good


Copy link to this message
-
SequenceFile compression on Amazon EMR not very good
Hi,

The size of my Gzipped weblog files is about 35MB. However, upon enabling
block compression, and inserting the logs into another Hive table
(sequencefile), the file size bloats up to about 233MB. I've done similar
processing on a local Hadoop/Hive cluster, and while the compressions is not
as good as gzipping, it still is not this bad. What could be going wrong?

I looked at the header of the resulting file and here's what it says:

SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec

Does Amazon Elastic MapReduce behave differently or am I doing something
wrong?

Saurabh.
--
http://nandz.blogspot.com
http://foodieforlife.blogspot.com