Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> SequenceFile compression on Amazon EMR not very good


Copy link to this message
-
Re: SequenceFile compression on Amazon EMR not very good
Thanks, Zheng. Will do some more tests and get back.

Saurabh.

On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao <[EMAIL PROTECTED]> wrote:

> I would first check whether it is really the block compression or
> record compression.
> Also maybe the block size is too small but I am not sure that is
> tunable in SequenceFile or not.
>
> Zheng
>
> On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda <[EMAIL PROTECTED]>
> wrote:
> > Hi,
> >
> > The size of my Gzipped weblog files is about 35MB. However, upon enabling
> > block compression, and inserting the logs into another Hive table
> > (sequencefile), the file size bloats up to about 233MB. I've done similar
> > processing on a local Hadoop/Hive cluster, and while the compressions is
> not
> > as good as gzipping, it still is not this bad. What could be going wrong?
> >
> > I looked at the header of the resulting file and here's what it says:
> >
> >
> SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec
> >
> > Does Amazon Elastic MapReduce behave differently or am I doing something
> > wrong?
> >
> > Saurabh.
> > --
> > http://nandz.blogspot.com
> > http://foodieforlife.blogspot.com
> >
>
>
>
> --
> Yours,
> Zheng
>

--
http://nandz.blogspot.com
http://foodieforlife.blogspot.com