Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - SequenceFile compression on Amazon EMR not very good


Copy link to this message
-
Re: SequenceFile compression on Amazon EMR not very good
Saurabh Nanda 2010-02-19, 13:46
I'm confused here Zheng. There are two sets of configuration variables.
Those starting with io.* and those starting with mapred.*. For making sure
that the final output table is compressed, which ones do I have to set?

Saurabh.

On Fri, Feb 19, 2010 at 12:37 AM, Zheng Shao <[EMAIL PROTECTED]> wrote:

> Did you also:
>
> SET mapred.output.compression.codec=org.apache....GZipCode;
>
> Zheng
>
> On Thu, Feb 18, 2010 at 8:25 AM, Saurabh Nanda <[EMAIL PROTECTED]>
> wrote:
> > Hi Zheng,
> >
> > I cross checked. I am setting the following in my Hive script before the
> > INSERT command:
> >
> > SET io.seqfile.compression.type=BLOCK;
> > SET hive.exec.compress.output=true;
> >
> > A 132 MB (gzipped) input file going through a cleanup and getting
> populated
> > in a sequencefile table is growing to 432 MB. What could be going wrong?
> >
> > Saurabh.
> >
> > On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda <[EMAIL PROTECTED]>
> > wrote:
> >>
> >> Thanks, Zheng. Will do some more tests and get back.
> >>
> >> Saurabh.
> >>
> >> On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao <[EMAIL PROTECTED]> wrote:
> >>>
> >>> I would first check whether it is really the block compression or
> >>> record compression.
> >>> Also maybe the block size is too small but I am not sure that is
> >>> tunable in SequenceFile or not.
> >>>
> >>> Zheng
> >>>
> >>> On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda <[EMAIL PROTECTED]
> >
> >>> wrote:
> >>> > Hi,
> >>> >
> >>> > The size of my Gzipped weblog files is about 35MB. However, upon
> >>> > enabling
> >>> > block compression, and inserting the logs into another Hive table
> >>> > (sequencefile), the file size bloats up to about 233MB. I've done
> >>> > similar
> >>> > processing on a local Hadoop/Hive cluster, and while the compressions
> >>> > is not
> >>> > as good as gzipping, it still is not this bad. What could be going
> >>> > wrong?
> >>> >
> >>> > I looked at the header of the resulting file and here's what it says:
> >>> >
> >>> >
> >>> >
> SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec
> >>> >
> >>> > Does Amazon Elastic MapReduce behave differently or am I doing
> >>> > something
> >>> > wrong?
> >>> >
> >>> > Saurabh.
> >>> > --
> >>> > http://nandz.blogspot.com
> >>> > http://foodieforlife.blogspot.com
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Yours,
> >>> Zheng
> >>
> >>
> >>
> >> --
> >> http://nandz.blogspot.com
> >> http://foodieforlife.blogspot.com
> >
> >
> >
> > --
> > http://nandz.blogspot.com
> > http://foodieforlife.blogspot.com
> >
>
>
>
> --
> Yours,
> Zheng
>

--
http://nandz.blogspot.com
http://foodieforlife.blogspot.com