|
|
-
SequenceFile compression on Amazon EMR not very good
Saurabh Nanda 2010-02-01, 05:03
Hi, The size of my Gzipped weblog files is about 35MB. However, upon enabling block compression, and inserting the logs into another Hive table (sequencefile), the file size bloats up to about 233MB. I've done similar processing on a local Hadoop/Hive cluster, and while the compressions is not as good as gzipping, it still is not this bad. What could be going wrong? I looked at the header of the resulting file and here's what it says: SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec Does Amazon Elastic MapReduce behave differently or am I doing something wrong? Saurabh. -- http://nandz.blogspot.comhttp://foodieforlife.blogspot.com
-
Re: SequenceFile compression on Amazon EMR not very good
Zheng Shao 2010-02-01, 07:52
I would first check whether it is really the block compression or record compression. Also maybe the block size is too small but I am not sure that is tunable in SequenceFile or not. Zheng On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda <[EMAIL PROTECTED]> wrote: > Hi, > > The size of my Gzipped weblog files is about 35MB. However, upon enabling > block compression, and inserting the logs into another Hive table > (sequencefile), the file size bloats up to about 233MB. I've done similar > processing on a local Hadoop/Hive cluster, and while the compressions is not > as good as gzipping, it still is not this bad. What could be going wrong? > > I looked at the header of the resulting file and here's what it says: > > SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec > > Does Amazon Elastic MapReduce behave differently or am I doing something > wrong? > > Saurabh. > -- > http://nandz.blogspot.com> http://foodieforlife.blogspot.com> -- Yours, Zheng
-
Re: SequenceFile compression on Amazon EMR not very good
Saurabh Nanda 2010-02-03, 08:56
Thanks, Zheng. Will do some more tests and get back. Saurabh. On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao <[EMAIL PROTECTED]> wrote: > I would first check whether it is really the block compression or > record compression. > Also maybe the block size is too small but I am not sure that is > tunable in SequenceFile or not. > > Zheng > > On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda <[EMAIL PROTECTED]> > wrote: > > Hi, > > > > The size of my Gzipped weblog files is about 35MB. However, upon enabling > > block compression, and inserting the logs into another Hive table > > (sequencefile), the file size bloats up to about 233MB. I've done similar > > processing on a local Hadoop/Hive cluster, and while the compressions is > not > > as good as gzipping, it still is not this bad. What could be going wrong? > > > > I looked at the header of the resulting file and here's what it says: > > > > > SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec > > > > Does Amazon Elastic MapReduce behave differently or am I doing something > > wrong? > > > > Saurabh. > > -- > > http://nandz.blogspot.com> > http://foodieforlife.blogspot.com> > > > > > -- > Yours, > Zheng > -- http://nandz.blogspot.comhttp://foodieforlife.blogspot.com
-
Re: SequenceFile compression on Amazon EMR not very good
Saurabh Nanda 2010-02-18, 16:25
Hi Zheng, I cross checked. I am setting the following in my Hive script before the INSERT command: SET io.seqfile.compression.type=BLOCK; SET hive.exec.compress.output=true; A 132 MB (gzipped) input file going through a cleanup and getting populated in a sequencefile table is growing to 432 MB. What could be going wrong? Saurabh. On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda <[EMAIL PROTECTED]>wrote: > Thanks, Zheng. Will do some more tests and get back. > > Saurabh. > > > On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao <[EMAIL PROTECTED]> wrote: > >> I would first check whether it is really the block compression or >> record compression. >> Also maybe the block size is too small but I am not sure that is >> tunable in SequenceFile or not. >> >> Zheng >> >> On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda <[EMAIL PROTECTED]> >> wrote: >> > Hi, >> > >> > The size of my Gzipped weblog files is about 35MB. However, upon >> enabling >> > block compression, and inserting the logs into another Hive table >> > (sequencefile), the file size bloats up to about 233MB. I've done >> similar >> > processing on a local Hadoop/Hive cluster, and while the compressions is >> not >> > as good as gzipping, it still is not this bad. What could be going >> wrong? >> > >> > I looked at the header of the resulting file and here's what it says: >> > >> > >> SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec >> > >> > Does Amazon Elastic MapReduce behave differently or am I doing something >> > wrong? >> > >> > Saurabh. >> > -- >> > http://nandz.blogspot.com>> > http://foodieforlife.blogspot.com>> > >> >> >> >> -- >> Yours, >> Zheng >> > > > > -- > http://nandz.blogspot.com> http://foodieforlife.blogspot.com> -- http://nandz.blogspot.comhttp://foodieforlife.blogspot.com
-
Re: SequenceFile compression on Amazon EMR not very good
Zheng Shao 2010-02-18, 19:07
Did you also: SET mapred.output.compression.codec=org.apache....GZipCode; Zheng On Thu, Feb 18, 2010 at 8:25 AM, Saurabh Nanda <[EMAIL PROTECTED]> wrote: > Hi Zheng, > > I cross checked. I am setting the following in my Hive script before the > INSERT command: > > SET io.seqfile.compression.type=BLOCK; > SET hive.exec.compress.output=true; > > A 132 MB (gzipped) input file going through a cleanup and getting populated > in a sequencefile table is growing to 432 MB. What could be going wrong? > > Saurabh. > > On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda <[EMAIL PROTECTED]> > wrote: >> >> Thanks, Zheng. Will do some more tests and get back. >> >> Saurabh. >> >> On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao <[EMAIL PROTECTED]> wrote: >>> >>> I would first check whether it is really the block compression or >>> record compression. >>> Also maybe the block size is too small but I am not sure that is >>> tunable in SequenceFile or not. >>> >>> Zheng >>> >>> On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda <[EMAIL PROTECTED]> >>> wrote: >>> > Hi, >>> > >>> > The size of my Gzipped weblog files is about 35MB. However, upon >>> > enabling >>> > block compression, and inserting the logs into another Hive table >>> > (sequencefile), the file size bloats up to about 233MB. I've done >>> > similar >>> > processing on a local Hadoop/Hive cluster, and while the compressions >>> > is not >>> > as good as gzipping, it still is not this bad. What could be going >>> > wrong? >>> > >>> > I looked at the header of the resulting file and here's what it says: >>> > >>> > >>> > SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec >>> > >>> > Does Amazon Elastic MapReduce behave differently or am I doing >>> > something >>> > wrong? >>> > >>> > Saurabh. >>> > -- >>> > http://nandz.blogspot.com>>> > http://foodieforlife.blogspot.com>>> > >>> >>> >>> >>> -- >>> Yours, >>> Zheng >> >> >> >> -- >> http://nandz.blogspot.com>> http://foodieforlife.blogspot.com> > > > -- > http://nandz.blogspot.com> http://foodieforlife.blogspot.com> -- Yours, Zheng
-
Re: SequenceFile compression on Amazon EMR not very good
Saurabh Nanda 2010-02-19, 13:46
I'm confused here Zheng. There are two sets of configuration variables. Those starting with io.* and those starting with mapred.*. For making sure that the final output table is compressed, which ones do I have to set? Saurabh. On Fri, Feb 19, 2010 at 12:37 AM, Zheng Shao <[EMAIL PROTECTED]> wrote: > Did you also: > > SET mapred.output.compression.codec=org.apache....GZipCode; > > Zheng > > On Thu, Feb 18, 2010 at 8:25 AM, Saurabh Nanda <[EMAIL PROTECTED]> > wrote: > > Hi Zheng, > > > > I cross checked. I am setting the following in my Hive script before the > > INSERT command: > > > > SET io.seqfile.compression.type=BLOCK; > > SET hive.exec.compress.output=true; > > > > A 132 MB (gzipped) input file going through a cleanup and getting > populated > > in a sequencefile table is growing to 432 MB. What could be going wrong? > > > > Saurabh. > > > > On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda <[EMAIL PROTECTED]> > > wrote: > >> > >> Thanks, Zheng. Will do some more tests and get back. > >> > >> Saurabh. > >> > >> On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao <[EMAIL PROTECTED]> wrote: > >>> > >>> I would first check whether it is really the block compression or > >>> record compression. > >>> Also maybe the block size is too small but I am not sure that is > >>> tunable in SequenceFile or not. > >>> > >>> Zheng > >>> > >>> On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda <[EMAIL PROTECTED] > > > >>> wrote: > >>> > Hi, > >>> > > >>> > The size of my Gzipped weblog files is about 35MB. However, upon > >>> > enabling > >>> > block compression, and inserting the logs into another Hive table > >>> > (sequencefile), the file size bloats up to about 233MB. I've done > >>> > similar > >>> > processing on a local Hadoop/Hive cluster, and while the compressions > >>> > is not > >>> > as good as gzipping, it still is not this bad. What could be going > >>> > wrong? > >>> > > >>> > I looked at the header of the resulting file and here's what it says: > >>> > > >>> > > >>> > > SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec > >>> > > >>> > Does Amazon Elastic MapReduce behave differently or am I doing > >>> > something > >>> > wrong? > >>> > > >>> > Saurabh. > >>> > -- > >>> > http://nandz.blogspot.com> >>> > http://foodieforlife.blogspot.com> >>> > > >>> > >>> > >>> > >>> -- > >>> Yours, > >>> Zheng > >> > >> > >> > >> -- > >> http://nandz.blogspot.com> >> http://foodieforlife.blogspot.com> > > > > > > > -- > > http://nandz.blogspot.com> > http://foodieforlife.blogspot.com> > > > > > -- > Yours, > Zheng > -- http://nandz.blogspot.comhttp://foodieforlife.blogspot.com
-
Re: SequenceFile compression on Amazon EMR not very good
Saurabh Nanda 2010-02-19, 13:53
And also hive.exec.compress.*. So that makes it three sets of configuration variables: mapred.output.compress.* io.seqfile.compress.* hive.exec.compress.* What's the relationship between these configuration parameters and which ones should I set to achieve a well compress output table? Saurabh. On Fri, Feb 19, 2010 at 7:16 PM, Saurabh Nanda <[EMAIL PROTECTED]>wrote: > I'm confused here Zheng. There are two sets of configuration variables. > Those starting with io.* and those starting with mapred.*. For making sure > that the final output table is compressed, which ones do I have to set? > > Saurabh. > > > On Fri, Feb 19, 2010 at 12:37 AM, Zheng Shao <[EMAIL PROTECTED]> wrote: > >> Did you also: >> >> SET mapred.output.compression.codec=org.apache....GZipCode; >> >> Zheng >> >> On Thu, Feb 18, 2010 at 8:25 AM, Saurabh Nanda <[EMAIL PROTECTED]> >> wrote: >> > Hi Zheng, >> > >> > I cross checked. I am setting the following in my Hive script before the >> > INSERT command: >> > >> > SET io.seqfile.compression.type=BLOCK; >> > SET hive.exec.compress.output=true; >> > >> > A 132 MB (gzipped) input file going through a cleanup and getting >> populated >> > in a sequencefile table is growing to 432 MB. What could be going wrong? >> > >> > Saurabh. >> > >> > On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda <[EMAIL PROTECTED]> >> > wrote: >> >> >> >> Thanks, Zheng. Will do some more tests and get back. >> >> >> >> Saurabh. >> >> >> >> On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao <[EMAIL PROTECTED]> wrote: >> >>> >> >>> I would first check whether it is really the block compression or >> >>> record compression. >> >>> Also maybe the block size is too small but I am not sure that is >> >>> tunable in SequenceFile or not. >> >>> >> >>> Zheng >> >>> >> >>> On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda < >> [EMAIL PROTECTED]> >> >>> wrote: >> >>> > Hi, >> >>> > >> >>> > The size of my Gzipped weblog files is about 35MB. However, upon >> >>> > enabling >> >>> > block compression, and inserting the logs into another Hive table >> >>> > (sequencefile), the file size bloats up to about 233MB. I've done >> >>> > similar >> >>> > processing on a local Hadoop/Hive cluster, and while the >> compressions >> >>> > is not >> >>> > as good as gzipping, it still is not this bad. What could be going >> >>> > wrong? >> >>> > >> >>> > I looked at the header of the resulting file and here's what it >> says: >> >>> > >> >>> > >> >>> > >> SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec >> >>> > >> >>> > Does Amazon Elastic MapReduce behave differently or am I doing >> >>> > something >> >>> > wrong? >> >>> > >> >>> > Saurabh. >> >>> > -- >> >>> > http://nandz.blogspot.com>> >>> > http://foodieforlife.blogspot.com>> >>> > >> >>> >> >>> >> >>> >> >>> -- >> >>> Yours, >> >>> Zheng >> >> >> >> >> >> >> >> -- >> >> http://nandz.blogspot.com>> >> http://foodieforlife.blogspot.com>> > >> > >> > >> > -- >> > http://nandz.blogspot.com>> > http://foodieforlife.blogspot.com>> > >> >> >> >> -- >> Yours, >> Zheng >> > > > > -- > http://nandz.blogspot.com> http://foodieforlife.blogspot.com> -- http://nandz.blogspot.comhttp://foodieforlife.blogspot.com
-
Re: SequenceFile compression on Amazon EMR not very good
Zheng Shao 2010-02-19, 18:09
hive.exec.compress.output controls whether or not to compress hive output. (This overrides mapred.output.compress in Hive). All other compression flags are from hadoop. Please see http://hadoop.apache.org/common/docs/r0.18.0/hadoop-default.htmlZheng On Fri, Feb 19, 2010 at 5:53 AM, Saurabh Nanda <[EMAIL PROTECTED]> wrote: > And also hive.exec.compress.*. So that makes it three sets of configuration > variables: > > mapred.output.compress.* > io.seqfile.compress.* > hive.exec.compress.* > > What's the relationship between these configuration parameters and which > ones should I set to achieve a well compress output table? > > Saurabh. > > On Fri, Feb 19, 2010 at 7:16 PM, Saurabh Nanda <[EMAIL PROTECTED]> > wrote: >> >> I'm confused here Zheng. There are two sets of configuration variables. >> Those starting with io.* and those starting with mapred.*. For making sure >> that the final output table is compressed, which ones do I have to set? >> >> Saurabh. >> >> On Fri, Feb 19, 2010 at 12:37 AM, Zheng Shao <[EMAIL PROTECTED]> wrote: >>> >>> Did you also: >>> >>> SET mapred.output.compression.codec=org.apache....GZipCode; >>> >>> Zheng >>> >>> On Thu, Feb 18, 2010 at 8:25 AM, Saurabh Nanda <[EMAIL PROTECTED]> >>> wrote: >>> > Hi Zheng, >>> > >>> > I cross checked. I am setting the following in my Hive script before >>> > the >>> > INSERT command: >>> > >>> > SET io.seqfile.compression.type=BLOCK; >>> > SET hive.exec.compress.output=true; >>> > >>> > A 132 MB (gzipped) input file going through a cleanup and getting >>> > populated >>> > in a sequencefile table is growing to 432 MB. What could be going >>> > wrong? >>> > >>> > Saurabh. >>> > >>> > On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda <[EMAIL PROTECTED]> >>> > wrote: >>> >> >>> >> Thanks, Zheng. Will do some more tests and get back. >>> >> >>> >> Saurabh. >>> >> >>> >> On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao <[EMAIL PROTECTED]> wrote: >>> >>> >>> >>> I would first check whether it is really the block compression or >>> >>> record compression. >>> >>> Also maybe the block size is too small but I am not sure that is >>> >>> tunable in SequenceFile or not. >>> >>> >>> >>> Zheng >>> >>> >>> >>> On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda >>> >>> <[EMAIL PROTECTED]> >>> >>> wrote: >>> >>> > Hi, >>> >>> > >>> >>> > The size of my Gzipped weblog files is about 35MB. However, upon >>> >>> > enabling >>> >>> > block compression, and inserting the logs into another Hive table >>> >>> > (sequencefile), the file size bloats up to about 233MB. I've done >>> >>> > similar >>> >>> > processing on a local Hadoop/Hive cluster, and while the >>> >>> > compressions >>> >>> > is not >>> >>> > as good as gzipping, it still is not this bad. What could be going >>> >>> > wrong? >>> >>> > >>> >>> > I looked at the header of the resulting file and here's what it >>> >>> > says: >>> >>> > >>> >>> > >>> >>> > >>> >>> > SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec >>> >>> > >>> >>> > Does Amazon Elastic MapReduce behave differently or am I doing >>> >>> > something >>> >>> > wrong? >>> >>> > >>> >>> > Saurabh. >>> >>> > -- >>> >>> > http://nandz.blogspot.com>>> >>> > http://foodieforlife.blogspot.com>>> >>> > >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Yours, >>> >>> Zheng >>> >> >>> >> >>> >> >>> >> -- >>> >> http://nandz.blogspot.com>>> >> http://foodieforlife.blogspot.com>>> > >>> > >>> > >>> > -- >>> > http://nandz.blogspot.com>>> > http://foodieforlife.blogspot.com>>> > >>> >>> >>> >>> -- >>> Yours, >>> Zheng >> >> >> >> -- >> http://nandz.blogspot.com>> http://foodieforlife.blogspot.com> > > > -- > http://nandz.blogspot.com> http://foodieforlife.blogspot.com> -- Yours, Zheng
|
|