|
|
-
Comparison between Gzip and LZO
José Vinícius Pimenta Col... 2011-03-02, 19:32
Hi, I'm making a comparison between the following compression methods: gzip and lzo provided by Hadoop and gzip from package java.util.zip. The test consists of compression and decompression of approximately 92,000 files with an average size of 2kb, however the decompression time of lzo is twice the decompression time of gzip provided by Hadoop, it does not seem right. The results obtained in the test are: Method | Bytes | Compression | Decompression - | - | Total Time(with i/o) Time Speed | Total Time(with i/o) Time Speed Gzip (Haddop) | 200876304 | 121.454s 43.167s 4,653,424.079 B/s | 332.305s 111.806s 1,796,635.326 B/s Lzo | 200876304 | 120.564s 54.072s 3,714,914.621 B/s | 509.371s 184.906s 1,086,368.904 B/s Gzip (java.util.zip) | 200876304 | 148.014s 63.414s 3,167,647.371 B/s | 483.148s 4.528s 44,360,682.244 B/s You can see the code I'm using to the test here: http://www.linux.ime.usp.br/~jvcoletto/compression/Can anyone explain me why am I getting these results? Thanks.
-
Re: Comparison between Gzip and LZO
Niels Basjes 2011-03-02, 20:16
Question: Are you 100% sure that nothing else was running on that system during the tests? No cron jobs, no "makewhatis" or "updatedb"? P.S. There is a permission issue with downloading one of the files. 2011/3/2 José Vinícius Pimenta Coletto <[EMAIL PROTECTED]>: > Hi, > > I'm making a comparison between the following compression methods: gzip > and lzo provided by Hadoop and gzip from package java.util.zip. > The test consists of compression and decompression of approximately 92,000 > files with an average size of 2kb, however the decompression time of lzo is > twice the decompression time of gzip provided by Hadoop, it does not seem > right. > The results obtained in the test are: > > Method | Bytes | Compression > | Decompression > - | - | Total Time(with i/o) Time Speed > | Total Time(with i/o) Time Speed > Gzip (Haddop) | 200876304 | 121.454s 43.167s > 4,653,424.079 B/s | 332.305s 111.806s 1,796,635.326 B/s > Lzo | 200876304 | 120.564s 54.072s > 3,714,914.621 B/s | 509.371s 184.906s 1,086,368.904 B/s > Gzip (java.util.zip) | 200876304 | 148.014s 63.414s > 3,167,647.371 B/s | 483.148s 4.528s 44,360,682.244 B/s > > You can see the code I'm using to the test here: > http://www.linux.ime.usp.br/~jvcoletto/compression/> > Can anyone explain me why am I getting these results? > Thanks. > -- Met vriendelijke groeten, Niels Basjes
-
Re: Comparison between Gzip and LZO
Brian Bockelman 2011-03-03, 03:12
I think some profiling is in order: claiming LZO decompresses at 1.0MB/s and is more than 3x faster at compression than decompression (especially when it's a well known asymmetric algorithm in favor of decompression speed) is somewhat unbelievable. I see that you use small files. Maybe whatever you do for LZO and Gzip/Hadoop has a large startup overhead? Again, sounds like you'll be spending an hour or so with a profiler. Brian On Mar 2, 2011, at 2:16 PM, Niels Basjes wrote: > Question: Are you 100% sure that nothing else was running on that > system during the tests? > No cron jobs, no "makewhatis" or "updatedb"? > > P.S. There is a permission issue with downloading one of the files. > > 2011/3/2 José Vinícius Pimenta Coletto <[EMAIL PROTECTED]>: >> Hi, >> >> I'm making a comparison between the following compression methods: gzip >> and lzo provided by Hadoop and gzip from package java.util.zip. >> The test consists of compression and decompression of approximately 92,000 >> files with an average size of 2kb, however the decompression time of lzo is >> twice the decompression time of gzip provided by Hadoop, it does not seem >> right. >> The results obtained in the test are: >> >> Method | Bytes | Compression >> | Decompression >> - | - | Total Time(with i/o) Time Speed >> | Total Time(with i/o) Time Speed >> Gzip (Haddop) | 200876304 | 121.454s 43.167s >> 4,653,424.079 B/s | 332.305s 111.806s 1,796,635.326 B/s >> Lzo | 200876304 | 120.564s 54.072s >> 3,714,914.621 B/s | 509.371s 184.906s 1,086,368.904 B/s >> Gzip (java.util.zip) | 200876304 | 148.014s 63.414s >> 3,167,647.371 B/s | 483.148s 4.528s 44,360,682.244 B/s >> >> You can see the code I'm using to the test here: >> http://www.linux.ime.usp.br/~jvcoletto/compression/>> >> Can anyone explain me why am I getting these results? >> Thanks. >> > > > > -- > Met vriendelijke groeten, > > Niels Basjes
-
Re: Comparison between Gzip and LZO
James Seigel 2011-03-03, 03:15
slightly not on point for this conversation, but I thought it worth mentioning....LZO is splitable, which makes it a good for for hadoopy things. Just something to remember when you do get some final results on performance. Cheers James. On 2011-03-02, at 8:12 PM, Brian Bockelman wrote: > > I think some profiling is in order: claiming LZO decompresses at 1.0MB/s and is more than 3x faster at compression than decompression (especially when it's a well known asymmetric algorithm in favor of decompression speed) is somewhat unbelievable. > > I see that you use small files. Maybe whatever you do for LZO and Gzip/Hadoop has a large startup overhead? > > Again, sounds like you'll be spending an hour or so with a profiler. > > Brian > > On Mar 2, 2011, at 2:16 PM, Niels Basjes wrote: > >> Question: Are you 100% sure that nothing else was running on that >> system during the tests? >> No cron jobs, no "makewhatis" or "updatedb"? >> >> P.S. There is a permission issue with downloading one of the files. >> >> 2011/3/2 José Vinícius Pimenta Coletto <[EMAIL PROTECTED]>: >>> Hi, >>> >>> I'm making a comparison between the following compression methods: gzip >>> and lzo provided by Hadoop and gzip from package java.util.zip. >>> The test consists of compression and decompression of approximately 92,000 >>> files with an average size of 2kb, however the decompression time of lzo is >>> twice the decompression time of gzip provided by Hadoop, it does not seem >>> right. >>> The results obtained in the test are: >>> >>> Method | Bytes | Compression >>> | Decompression >>> - | - | Total Time(with i/o) Time Speed >>> | Total Time(with i/o) Time Speed >>> Gzip (Haddop) | 200876304 | 121.454s 43.167s >>> 4,653,424.079 B/s | 332.305s 111.806s 1,796,635.326 B/s >>> Lzo | 200876304 | 120.564s 54.072s >>> 3,714,914.621 B/s | 509.371s 184.906s 1,086,368.904 B/s >>> Gzip (java.util.zip) | 200876304 | 148.014s 63.414s >>> 3,167,647.371 B/s | 483.148s 4.528s 44,360,682.244 B/s >>> >>> You can see the code I'm using to the test here: >>> http://www.linux.ime.usp.br/~jvcoletto/compression/>>> >>> Can anyone explain me why am I getting these results? >>> Thanks. >>> >> >> >> >> -- >> Met vriendelijke groeten, >> >> Niels Basjes >
-
Re: Comparison between Gzip and LZO
Jose Vinicius Pimenta Col... 2011-03-03, 09:15
During the tests the only programs that are running are Eclipse and Chromium , I don't believe they affect the results because they are running during the entire test.
The permissions issue has been fixed.
Thanks.
-- Jose Vinicius Pimenta Coletto
-
Re: Comparison between Gzip and LZO
Jose Vinicius Pimenta Col... 2011-03-29, 20:45
During this month I refactor the code used for the tests and kept doing them with the same base mentioned above (about 92 000 files with an average size of 2kb), but procedure differently: I ran the compression and decompression 50 times in eight different computers. The results were not different from those previously reported, on average gzip was two times faster than the lzo. As a last resort did profiling with JProfiler, but found nothing to explain to me why the gzip be faster than lzo. In this address http://www.linux.ime.usp.br/~jvcoletto/compression/ I share the table with the results obtained in the tests, the code used in the tests and the results obtained in JProfiler. Anyone have any ideas to help me? Thank you. -- Jose Vinicius Pimenta Coletto Em 2 de março de 2011 16:32, José Vinícius Pimenta Coletto < [EMAIL PROTECTED]> escreveu: > Hi, > > I'm making a comparison between the following compression methods: gzip > and lzo provided by Hadoop and gzip from package java.util.zip. > The test consists of compression and decompression of approximately 92,000 > files with an average size of 2kb, however the decompression time of lzo is > twice the decompression time of gzip provided by Hadoop, it does not seem > right. > The results obtained in the test are: > > Method | Bytes | Compression > | Decompression > - | - | Total Time(with i/o) Time Speed > | Total Time(with i/o) Time Speed > Gzip (Haddop) | 200876304 | 121.454s 43.167s > 4,653,424.079 B/s | 332.305s 111.806s 1,796,635.326 B/s > Lzo | 200876304 | 120.564s 54.072s > 3,714,914.621 B/s | 509.371s 184.906s 1,086,368.904 B/s > Gzip (java.util.zip) | 200876304 | 148.014s 63.414s > 3,167,647.371 B/s | 483.148s 4.528s 44,360,682.244 B/s > > You can see the code I'm using to the test here: > http://www.linux.ime.usp.br/~jvcoletto/compression/> > Can anyone explain me why am I getting these results? > Thanks. >
-
Re: Comparison between Gzip and LZO
Greg Roelofs 2011-03-29, 22:54
> During this month I refactor the code used for the tests and kept doing > them with the same base mentioned above (about 92 000 files with an average > size of 2kb), Those are _tiny_. It seems likely to me that you're spending most of your time on I/O related to metadata (disk seeks, directory traversal, file open/ close, codec setup/teardown, buffer-cache churn) and very little on "real" compression or even "real" file I/O. Is any of this happening on HDFS? If so, add network I/O and namenode overhead, too. For Hadoop, your file sizes should start at megabytes or tens of megabytes, and it will really hit its stride above that. Also, are you compressing text or binaries? > In this address http://www.linux.ime.usp.br/~jvcoletto/compression/ I share the > table with the results obtained in the tests, the code used in the tests and > the results obtained in JProfiler. In my own tests with (C) command-line tools on Linux (and I've now forgotten whether the system used fast SCSI disks or regular SATA), lzop's decompression speed averaged 18-21 compressed MB/sec for binaries and 5-8 cMB/sec for text. gzip on the same corpus averaged 9-10 cMB/sec for binaries and 3.5-4.5 cMB/sec for text. (Text compresses better, so the same input size means more output size => slower due to I/O.) For compression, gzip ranged from 2.5-10 uncompressed MB/sec, depending on data type and compression level. lzop is basically two compressors; for levels 1-6, it averaged 15-16.5 ucMB/sec regardless of input or level, while levels 7-9 dropped from 3 to 1 ucMB/sec. (IOW, don't use LZO levels above 6.) Java interfaces will add some overhead, but since all of the codecs in question are ultimately native C code, this should give you some idea of which numbers are most suspect. But don't bother benchmarking anything much below a megabyte; it's a waste of time. Greg
|
|