|
Vincent Barat
2010-02-23, 18:39
Jean-Daniel Cryans
2010-02-23, 21:15
Vincent Barat
2010-02-24, 09:52
Jean-Daniel Cryans
2010-02-24, 18:10
Vincent Barat
2010-02-25, 10:49
Jean-Daniel Cryans
2010-02-25, 18:29
Vincent Barat
2010-02-28, 17:30
Dan Washusen
2010-02-28, 21:46
Jean-Daniel Cryans
2010-02-28, 23:56
Dan Washusen
2010-03-01, 00:20
Jean-Daniel Cryans
2010-03-01, 00:24
Vincent Barat
2010-03-01, 15:18
Jean-Daniel Cryans
2010-03-01, 19:16
Vincent Barat
2010-03-02, 10:42
|
-
LZO vs GZIP vs NO COMPREESSION: why is GZIP the winner ???Vincent Barat 2010-02-23, 18:39
Hello,
I did some testing to figure out which compression algo I should use for my HBase tables. I thought that LZO was the good candidate, but it appears that it is the worst one. I uses one table with 2 families and 10 columns. Each row has a total of 200 to 400 bytes. Here is my results: GZIP: 2600 to 3200 inserts/s 12000 to 15000 reads/s NO COMPRESSION: 2000 to 2600 inserts/s 4900 to 5020 reads/s LZO 1600 to 2100 inserts/s 4020 to 4600 reads/s Do you have an explanation to this ? I though that the LZO compression was always faster at compression and decompression than GZIP ?
-
Re: LZO vs GZIP vs NO COMPREESSION: why is GZIP the winner ???Jean-Daniel Cryans 2010-02-23, 21:15
Vincent,
I don't expect that either, can you give us more info about your test environment? Thx, J-D On Tue, Feb 23, 2010 at 10:39 AM, Vincent Barat <[EMAIL PROTECTED]> wrote: > Hello, > > I did some testing to figure out which compression algo I should use for my > HBase tables. I thought that LZO was the good candidate, but it appears that > it is the worst one. > > I uses one table with 2 families and 10 columns. Each row has a total of 200 > to 400 bytes. > > Here is my results: > > GZIP: 2600 to 3200 inserts/s 12000 to 15000 reads/s > NO COMPRESSION: 2000 to 2600 inserts/s 4900 to 5020 reads/s > LZO 1600 to 2100 inserts/s 4020 to 4600 reads/s > > Do you have an explanation to this ? I though that the LZO compression was > always faster at compression and decompression than GZIP ? > > >
-
Re: LZO vs GZIP vs NO COMPREESSION: why is GZIP the winner ???Vincent Barat 2010-02-24, 09:52
Yes of course.
We use a 4 machine cluster (4 large instances on AWS): 8 GB RAM each, dual core CPU. 1 is for the Hadoop and HBase namenode / masters, and 3 are hosting the datanode / regionservers. The table used for testing is first created, then I insert sequentially a set of rows and count the nb of rows inserted by second. I insert rows by set of 1000 (using HTable.put(list<Put>); When reading, I read also sequentially by using a scanner (scanner caching is set to 1024 rows). Maybe our installation of LZO is not good ? Le 23/02/10 22:15, Jean-Daniel Cryans a �crit : > Vincent, > > I don't expect that either, can you give us more info about your test > environment? > > Thx, > > J-D > > On Tue, Feb 23, 2010 at 10:39 AM, Vincent Barat > <[EMAIL PROTECTED]> wrote: >> Hello, >> >> I did some testing to figure out which compression algo I should use for my >> HBase tables. I thought that LZO was the good candidate, but it appears that >> it is the worst one. >> >> I uses one table with 2 families and 10 columns. Each row has a total of 200 >> to 400 bytes. >> >> Here is my results: >> >> GZIP: 2600 to 3200 inserts/s 12000 to 15000 reads/s >> NO COMPRESSION: 2000 to 2600 inserts/s 4900 to 5020 reads/s >> LZO 1600 to 2100 inserts/s 4020 to 4600 reads/s >> >> Do you have an explanation to this ? I though that the LZO compression was >> always faster at compression and decompression than GZIP ? >> >> >> >
-
Re: LZO vs GZIP vs NO COMPREESSION: why is GZIP the winner ???Jean-Daniel Cryans 2010-02-24, 18:10
Are you able to post the code used for the insertion? It could be
something with your usage pattern or something wrong with the code itself. How many rows are you inserting? Do you even have some region splits? J-D On Wed, Feb 24, 2010 at 1:52 AM, Vincent Barat <[EMAIL PROTECTED]> wrote: > Yes of course. > > We use a 4 machine cluster (4 large instances on AWS): 8 GB RAM each, dual > core CPU. 1 is for the Hadoop and HBase namenode / masters, and 3 are > hosting the datanode / regionservers. > > The table used for testing is first created, then I insert sequentially a > set of rows and count the nb of rows inserted by second. > > I insert rows by set of 1000 (using HTable.put(list<Put>); > > When reading, I read also sequentially by using a scanner (scanner caching > is set to 1024 rows). > > Maybe our installation of LZO is not good ? > > > Le 23/02/10 22:15, Jean-Daniel Cryans a écrit : >> >> Vincent, >> >> I don't expect that either, can you give us more info about your test >> environment? >> >> Thx, >> >> J-D >> >> On Tue, Feb 23, 2010 at 10:39 AM, Vincent Barat >> <[EMAIL PROTECTED]> wrote: >>> >>> Hello, >>> >>> I did some testing to figure out which compression algo I should use for >>> my >>> HBase tables. I thought that LZO was the good candidate, but it appears >>> that >>> it is the worst one. >>> >>> I uses one table with 2 families and 10 columns. Each row has a total of >>> 200 >>> to 400 bytes. >>> >>> Here is my results: >>> >>> GZIP: 2600 to 3200 inserts/s 12000 to 15000 reads/s >>> NO COMPRESSION: 2000 to 2600 inserts/s 4900 to 5020 reads/s >>> LZO 1600 to 2100 inserts/s 4020 to 4600 reads/s >>> >>> Do you have an explanation to this ? I though that the LZO compression >>> was >>> always faster at compression and decompression than GZIP ? >>> >>> >>> >> >
-
Re: LZO vs GZIP vs NO COMPREESSION: why is GZIP the winner ???Vincent Barat 2010-02-25, 10:49
Unfortunately I can post only some snapshots.
I have no region split (I insert just 100000 rows so there is no split, except when I don't use compression). I use HBase 0.20.2 and to insert I use the HTable.put(list<Put>); The only difference between my 3 tests is the way I create the test table: HBaseAdmin admin = new HBaseAdmin(config); HTableDescriptor desc = new HTableDescriptor(name); HColumnDescriptor colDesc; colDesc = new HColumnDescriptor(Bytes.toBytes("meta:")); colDesc.setMaxVersions(1); colDesc.setCompressionType(Algorithm.GZ); <- LZO or NONE desc.addFamily(colDesc); colDesc = new HColumnDescriptor(Bytes.toBytes("data:")); colDesc.setMaxVersions(1); colDesc.setCompressionType(Algorithm.GZ); <- LZO or NONE desc.addFamily(colDesc); admin.createTable(desc); A typical row inserted is made of 13 columns with a short content, as show here: 1264761195240/6ffc3fe659023 column=data:accuracy, timestamp=1267006115356, value=1317 a3c9cfed0a50a9f199ed42f2730 1264761195240/6ffc3fe659023 column=data:alt, timestamp=1267006115356, value=0 a3c9cfed0a50a9f199ed42f2730 1264761195240/6ffc3fe659023 column=data:country, timestamp=1267006115356, value=France a3c9cfed0a50a9f199ed42f2730 1264761195240/6ffc3fe659023 column=data:countrycode, timestamp=1267006115356, value=FR a3c9cfed0a50a9f199ed42f2730 1264761195240/6ffc3fe659023 column=data:lat, timestamp=1267006115356, value=48.65869706 a3c9cfed0a50a9f199ed42f2730 1264761195240/6ffc3fe659023 column=data:locality, timestamp=1267006115356, value=Morsang-sur-Orge a3c9cfed0a50a9f199ed42f2730 1264761195240/6ffc3fe659023 column=data:lon, timestamp=1267006115356, value=2.36138182 a3c9cfed0a50a9f199ed42f2730 1264761195240/6ffc3fe659023 column=data:postalcode, timestamp=1267006115356, value=91390 a3c9cfed0a50a9f199ed42f2730 1264761195240/6ffc3fe659023 column=data:region, timestamp=1267006115356, value=Ile-de-France a3c9cfed0a50a9f199ed42f2730 1264761195240/6ffc3fe659023 column=meta:imei, timestamp=1267006115356, value=6ffc3fe659023a3c9cfed0a50a9f199e a3c9cfed0a50a9f199ed42f2730 d42f2730 1264761195240/6ffc3fe659023 column=meta:infoid, timestamp=1267006115356, value=ca30781e0c375a1236afbf323cbfa4 a3c9cfed0a50a9f199ed42f2730 0dc2c7c7af 1264761195240/6ffc3fe659023 column=meta:locid, timestamp=1267006115356, value=5e15a0281e83cfe55ec1c362f84a39f a3c9cfed0a50a9f199ed42f2730 006f18128 1264761195240/6ffc3fe659023 column=meta:timestamp, timestamp=1267006115356, value=1264761195240 a3c9cfed0a50a9f199ed42f2730 Maybe LZO works much better with fewer rows with bigger content? Le 24/02/10 19:10, Jean-Daniel Cryans a �crit : > Are you able to post the code used for the insertion? It could be > something with your usage pattern or something wrong with the code > itself. > > How many rows are you inserting? Do you even have some region splits? > > J-D > > On Wed, Feb 24, 2010 at 1:52 AM, Vincent Barat<[EMAIL PROTECTED]> wrote: >> Yes of course. >> >> We use a 4 machine cluster (4 large instances on AWS): 8 GB RAM each, dual >> core CPU. 1 is for the Hadoop and HBase namenode / masters, and 3 are >> hosting the datanode / regionservers. >> >> The table used for testing is first created, then I insert sequentially a >> set of rows and count the nb of rows inserted by second. >> >> I insert rows by set of 1000 (using HTable.put(list<Put>); >> >> When reading, I read also sequentially by using a scanner (scanner caching >> is set to 1024 rows). >> >> Maybe our installation of LZO is not good ? >> >> >> Le 23/02/10 22:15, Jean-Daniel Cryans a �crit : >>> >>> Vincent, >>> >>> I don't expect that either, can you give us more info about your test >>> environment? >>> >>> Thx, >>> >>> J-D >>> >>> On Tue, Feb 23, 2010 at 10:39 AM, Vincent Barat >>> <[EMAIL PROTECTED]> wrote: >>>> >>>> Hello, >>>> >>>> I did some testing to figure out which compression algo I should use for >>>> my >>>> HBase tables. I thought that LZO was the good candidate, but it appears
-
Re: LZO vs GZIP vs NO COMPREESSION: why is GZIP the winner ???Jean-Daniel Cryans 2010-02-25, 18:29
If only 1 region, providing more than one nodes will probably just
slow down the test since the load is handled by one machine which has to replicate blocks 2 times. I think your test would have much more value if you really grew at least to 10 regions. Also make sure to run the tests more than once on completely new hbase setups (drop table + restart should be enough). May I also recommend upgrading to hbase 0.20.3? It will provide a better experience in general. J-D On Thu, Feb 25, 2010 at 2:49 AM, Vincent Barat <[EMAIL PROTECTED]> wrote: > Unfortunately I can post only some snapshots. > > I have no region split (I insert just 100000 rows so there is no split, > except when I don't use compression). > > I use HBase 0.20.2 and to insert I use the HTable.put(list<Put>); > > The only difference between my 3 tests is the way I create the test table: > > HBaseAdmin admin = new HBaseAdmin(config); > > HTableDescriptor desc = new HTableDescriptor(name); > > HColumnDescriptor colDesc; > > colDesc = new HColumnDescriptor(Bytes.toBytes("meta:")); > colDesc.setMaxVersions(1); > colDesc.setCompressionType(Algorithm.GZ); <- LZO or NONE > desc.addFamily(colDesc); > > colDesc = new HColumnDescriptor(Bytes.toBytes("data:")); > colDesc.setMaxVersions(1); > colDesc.setCompressionType(Algorithm.GZ); <- LZO or NONE > desc.addFamily(colDesc); > > admin.createTable(desc); > > A typical row inserted is made of 13 columns with a short content, as show > here: > > 1264761195240/6ffc3fe659023 column=data:accuracy, timestamp=1267006115356, > value=1317 > a3c9cfed0a50a9f199ed42f2730 > 1264761195240/6ffc3fe659023 column=data:alt, timestamp=1267006115356, > value=0 > a3c9cfed0a50a9f199ed42f2730 > 1264761195240/6ffc3fe659023 column=data:country, timestamp=1267006115356, > value=France > a3c9cfed0a50a9f199ed42f2730 > 1264761195240/6ffc3fe659023 column=data:countrycode, > timestamp=1267006115356, value=FR > a3c9cfed0a50a9f199ed42f2730 > 1264761195240/6ffc3fe659023 column=data:lat, timestamp=1267006115356, > value=48.65869706 > a3c9cfed0a50a9f199ed42f2730 > 1264761195240/6ffc3fe659023 column=data:locality, timestamp=1267006115356, > value=Morsang-sur-Orge > a3c9cfed0a50a9f199ed42f2730 > 1264761195240/6ffc3fe659023 column=data:lon, timestamp=1267006115356, > value=2.36138182 > a3c9cfed0a50a9f199ed42f2730 > 1264761195240/6ffc3fe659023 column=data:postalcode, > timestamp=1267006115356, value=91390 > a3c9cfed0a50a9f199ed42f2730 > 1264761195240/6ffc3fe659023 column=data:region, timestamp=1267006115356, > value=Ile-de-France > a3c9cfed0a50a9f199ed42f2730 > 1264761195240/6ffc3fe659023 column=meta:imei, timestamp=1267006115356, > value=6ffc3fe659023a3c9cfed0a50a9f199e > a3c9cfed0a50a9f199ed42f2730 d42f2730 > 1264761195240/6ffc3fe659023 column=meta:infoid, timestamp=1267006115356, > value=ca30781e0c375a1236afbf323cbfa4 > a3c9cfed0a50a9f199ed42f2730 0dc2c7c7af > 1264761195240/6ffc3fe659023 column=meta:locid, timestamp=1267006115356, > value=5e15a0281e83cfe55ec1c362f84a39f > a3c9cfed0a50a9f199ed42f2730 006f18128 > 1264761195240/6ffc3fe659023 column=meta:timestamp, timestamp=1267006115356, > value=1264761195240 > a3c9cfed0a50a9f199ed42f2730 > > Maybe LZO works much better with fewer rows with bigger content? > > Le 24/02/10 19:10, Jean-Daniel Cryans a écrit : >> >> Are you able to post the code used for the insertion? It could be >> something with your usage pattern or something wrong with the code >> itself. >> >> How many rows are you inserting? Do you even have some region splits? >> >> J-D >> >> On Wed, Feb 24, 2010 at 1:52 AM, Vincent Barat<[EMAIL PROTECTED]> >> wrote: >>> >>> Yes of course. >>> >>> We use a 4 machine cluster (4 large instances on AWS): 8 GB RAM each, >>> dual >>> core CPU. 1 is for the Hadoop and HBase namenode / masters, and 3 are >>> hosting the datanode / regionservers. >>> >>> The table used for testing is first created, then I insert sequentially a >>> set of rows and count the nb of rows inserted by second.
-
Re: LZO vs GZIP vs NO COMPREESSION: why is GZIP the winner ???Vincent Barat 2010-02-28, 17:30
The impact of my cluster architecture on the performances is
obviously the same in my 3 test cases. Providing that I only change the compression type between tests, I don't understand why changing the number of regions or whatever else would change the speed ratio between my tests, especially between the GZIP & LZO tests. Is there some ready to use and easy to setup benchmarks I could use to try to reproduce the issue in a well known environment ? Le 25/02/10 19:29, Jean-Daniel Cryans a �crit : > If only 1 region, providing more than one nodes will probably just > slow down the test since the load is handled by one machine which has > to replicate blocks 2 times. I think your test would have much more > value if you really grew at least to 10 regions. Also make sure to run > the tests more than once on completely new hbase setups (drop table + > restart should be enough). > > May I also recommend upgrading to hbase 0.20.3? It will provide a > better experience in general. > > J-D > > On Thu, Feb 25, 2010 at 2:49 AM, Vincent Barat<[EMAIL PROTECTED]> wrote: >> Unfortunately I can post only some snapshots. >> >> I have no region split (I insert just 100000 rows so there is no split, >> except when I don't use compression). >> >> I use HBase 0.20.2 and to insert I use the HTable.put(list<Put>); >> >> The only difference between my 3 tests is the way I create the test table: >> >> HBaseAdmin admin = new HBaseAdmin(config); >> >> HTableDescriptor desc = new HTableDescriptor(name); >> >> HColumnDescriptor colDesc; >> >> colDesc = new HColumnDescriptor(Bytes.toBytes("meta:")); >> colDesc.setMaxVersions(1); >> colDesc.setCompressionType(Algorithm.GZ);<- LZO or NONE >> desc.addFamily(colDesc); >> >> colDesc = new HColumnDescriptor(Bytes.toBytes("data:")); >> colDesc.setMaxVersions(1); >> colDesc.setCompressionType(Algorithm.GZ);<- LZO or NONE >> desc.addFamily(colDesc); >> >> admin.createTable(desc); >> >> A typical row inserted is made of 13 columns with a short content, as show >> here: >> >> 1264761195240/6ffc3fe659023 column=data:accuracy, timestamp=1267006115356, >> value=1317 >> a3c9cfed0a50a9f199ed42f2730 >> 1264761195240/6ffc3fe659023 column=data:alt, timestamp=1267006115356, >> value=0 >> a3c9cfed0a50a9f199ed42f2730 >> 1264761195240/6ffc3fe659023 column=data:country, timestamp=1267006115356, >> value=France >> a3c9cfed0a50a9f199ed42f2730 >> 1264761195240/6ffc3fe659023 column=data:countrycode, >> timestamp=1267006115356, value=FR >> a3c9cfed0a50a9f199ed42f2730 >> 1264761195240/6ffc3fe659023 column=data:lat, timestamp=1267006115356, >> value=48.65869706 >> a3c9cfed0a50a9f199ed42f2730 >> 1264761195240/6ffc3fe659023 column=data:locality, timestamp=1267006115356, >> value=Morsang-sur-Orge >> a3c9cfed0a50a9f199ed42f2730 >> 1264761195240/6ffc3fe659023 column=data:lon, timestamp=1267006115356, >> value=2.36138182 >> a3c9cfed0a50a9f199ed42f2730 >> 1264761195240/6ffc3fe659023 column=data:postalcode, >> timestamp=1267006115356, value=91390 >> a3c9cfed0a50a9f199ed42f2730 >> 1264761195240/6ffc3fe659023 column=data:region, timestamp=1267006115356, >> value=Ile-de-France >> a3c9cfed0a50a9f199ed42f2730 >> 1264761195240/6ffc3fe659023 column=meta:imei, timestamp=1267006115356, >> value=6ffc3fe659023a3c9cfed0a50a9f199e >> a3c9cfed0a50a9f199ed42f2730 d42f2730 >> 1264761195240/6ffc3fe659023 column=meta:infoid, timestamp=1267006115356, >> value=ca30781e0c375a1236afbf323cbfa4 >> a3c9cfed0a50a9f199ed42f2730 0dc2c7c7af >> 1264761195240/6ffc3fe659023 column=meta:locid, timestamp=1267006115356, >> value=5e15a0281e83cfe55ec1c362f84a39f >> a3c9cfed0a50a9f199ed42f2730 006f18128 >> 1264761195240/6ffc3fe659023 column=meta:timestamp, timestamp=1267006115356, >> value=1264761195240 >> a3c9cfed0a50a9f199ed42f2730 >> >> Maybe LZO works much better with fewer rows with bigger content? >> >> Le 24/02/10 19:10, Jean-Daniel Cryans a �crit : >>> >>> Are you able to post the code used for the insertion? It could be
-
Re: LZO vs GZIP vs NO COMPREESSION: why is GZIP the winner ???Dan Washusen 2010-02-28, 21:46
Couple of questions;
- What's your block cache hit ratio when running each of those tests? - How large are the store files in each of the tests? - What's the compression ratio between None, LZO and GZIP? Does the test match your expected usage scenario? Do you intend to serve all data from block cache or will there be a lot more data in real life? With such a small dataset your are probably not seeing the full benefits of the bullet points mentioned on the HBase + LZO page because all the data resides in memory on the region server... Here are the points from http://wiki.apache.org/hadoop/UsingLzoCompression: > > - Compression reduces the number of bytes written to/read from HDFS > - Compression effectively improves the efficiency of network bandwidth > and disk space > - Compression reduces the size of data needed to be read when issuing a > read > > It's puzzling that GZIP is faster than no compression in your tests... On 1 March 2010 04:30, Vincent Barat <[EMAIL PROTECTED]> wrote: > The impact of my cluster architecture on the performances is obviously the > same in my 3 test cases. Providing that I only change the compression type > between tests, I don't understand why changing the number of regions or > whatever else would change the speed ratio between my tests, especially > between the GZIP & LZO tests. > > Is there some ready to use and easy to setup benchmarks I could use to try > to reproduce the issue in a well known environment ? > > Le 25/02/10 19:29, Jean-Daniel Cryans a écrit : > > If only 1 region, providing more than one nodes will probably just >> slow down the test since the load is handled by one machine which has >> to replicate blocks 2 times. I think your test would have much more >> value if you really grew at least to 10 regions. Also make sure to run >> the tests more than once on completely new hbase setups (drop table + >> restart should be enough). >> >> May I also recommend upgrading to hbase 0.20.3? It will provide a >> better experience in general. >> >> J-D >> >> On Thu, Feb 25, 2010 at 2:49 AM, Vincent Barat<[EMAIL PROTECTED]> >> wrote: >> >>> Unfortunately I can post only some snapshots. >>> >>> I have no region split (I insert just 100000 rows so there is no split, >>> except when I don't use compression). >>> >>> I use HBase 0.20.2 and to insert I use the HTable.put(list<Put>); >>> >>> The only difference between my 3 tests is the way I create the test >>> table: >>> >>> HBaseAdmin admin = new HBaseAdmin(config); >>> >>> HTableDescriptor desc = new HTableDescriptor(name); >>> >>> HColumnDescriptor colDesc; >>> >>> colDesc = new HColumnDescriptor(Bytes.toBytes("meta:")); >>> colDesc.setMaxVersions(1); >>> colDesc.setCompressionType(Algorithm.GZ);<- LZO or NONE >>> desc.addFamily(colDesc); >>> >>> colDesc = new HColumnDescriptor(Bytes.toBytes("data:")); >>> colDesc.setMaxVersions(1); >>> colDesc.setCompressionType(Algorithm.GZ);<- LZO or NONE >>> desc.addFamily(colDesc); >>> >>> admin.createTable(desc); >>> >>> A typical row inserted is made of 13 columns with a short content, as >>> show >>> here: >>> >>> 1264761195240/6ffc3fe659023 column=data:accuracy, >>> timestamp=1267006115356, >>> value=1317 >>> a3c9cfed0a50a9f199ed42f2730 >>> 1264761195240/6ffc3fe659023 column=data:alt, timestamp=1267006115356, >>> value=0 >>> a3c9cfed0a50a9f199ed42f2730 >>> 1264761195240/6ffc3fe659023 column=data:country, >>> timestamp=1267006115356, >>> value=France >>> a3c9cfed0a50a9f199ed42f2730 >>> 1264761195240/6ffc3fe659023 column=data:countrycode, >>> timestamp=1267006115356, value=FR >>> a3c9cfed0a50a9f199ed42f2730 >>> 1264761195240/6ffc3fe659023 column=data:lat, timestamp=1267006115356, >>> value=48.65869706 >>> a3c9cfed0a50a9f199ed42f2730 >>> 1264761195240/6ffc3fe659023 column=data:locality, >>> timestamp=1267006115356, >>> value=Morsang-sur-Orge >>> a3c9cfed0a50a9f199ed42f2730 >>> 1264761195240/6ffc3fe659023 column=data:lon, timestamp=1267006115356,
-
Re: LZO vs GZIP vs NO COMPREESSION: why is GZIP the winner ???Jean-Daniel Cryans 2010-02-28, 23:56
As Dan said, your data is so small you don't really trigger many
different behaviors in HBase, it could very well kept mostly in the memstores where compression has no impact at all. WRT a benchmark, there's the PerformanceEvaluation (we call it PE for short) which is well maintained and lets you set a compression level. This page has an outdated help but it shows you how to run it: http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation Another option is importing the wikipedia dump, which is highly compressible and not manufactured like the PE. Last summer I wrote a small MR job to do the import easily and although the code is based on a dev version 0.20.0, it should be fairly easy to make it work on 0.20.3 (probably just replacing the libs). See http://code.google.com/p/hbase-wikipedia-loader/ See the last paragraph of the Getting Started in the Wiki, I show some import numbers: "For example, it took 29 min on a 6 nodes cluster (1 master and 5 region servers) with the same hardware (AMD Phenom(tm) 9550 Quad, 8GB, 2x1TB disks), 2 map slot per task tracker (that's 10 parallel maps), and GZ compression. With LZO and a new table it took 23 min 20 ses. Compressed the table is 32 regions big, uncompressed it's 93 and took 30 min 10 sec to import." You can see that the import was a lot faster on LZO. I didn't do any reading test tho... Good luck! J-D On Sun, Feb 28, 2010 at 9:30 AM, Vincent Barat <[EMAIL PROTECTED]> wrote: > The impact of my cluster architecture on the performances is obviously the > same in my 3 test cases. Providing that I only change the compression type > between tests, I don't understand why changing the number of regions or > whatever else would change the speed ratio between my tests, especially > between the GZIP & LZO tests. > > Is there some ready to use and easy to setup benchmarks I could use to try > to reproduce the issue in a well known environment ? > > Le 25/02/10 19:29, Jean-Daniel Cryans a écrit : >> >> If only 1 region, providing more than one nodes will probably just >> slow down the test since the load is handled by one machine which has >> to replicate blocks 2 times. I think your test would have much more >> value if you really grew at least to 10 regions. Also make sure to run >> the tests more than once on completely new hbase setups (drop table + >> restart should be enough). >> >> May I also recommend upgrading to hbase 0.20.3? It will provide a >> better experience in general. >> >> J-D >> >> On Thu, Feb 25, 2010 at 2:49 AM, Vincent Barat<[EMAIL PROTECTED]> >> wrote: >>> >>> Unfortunately I can post only some snapshots. >>> >>> I have no region split (I insert just 100000 rows so there is no split, >>> except when I don't use compression). >>> >>> I use HBase 0.20.2 and to insert I use the HTable.put(list<Put>); >>> >>> The only difference between my 3 tests is the way I create the test >>> table: >>> >>> HBaseAdmin admin = new HBaseAdmin(config); >>> >>> HTableDescriptor desc = new HTableDescriptor(name); >>> >>> HColumnDescriptor colDesc; >>> >>> colDesc = new HColumnDescriptor(Bytes.toBytes("meta:")); >>> colDesc.setMaxVersions(1); >>> colDesc.setCompressionType(Algorithm.GZ);<- LZO or NONE >>> desc.addFamily(colDesc); >>> >>> colDesc = new HColumnDescriptor(Bytes.toBytes("data:")); >>> colDesc.setMaxVersions(1); >>> colDesc.setCompressionType(Algorithm.GZ);<- LZO or NONE >>> desc.addFamily(colDesc); >>> >>> admin.createTable(desc); >>> >>> A typical row inserted is made of 13 columns with a short content, as >>> show >>> here: >>> >>> 1264761195240/6ffc3fe659023 column=data:accuracy, >>> timestamp=1267006115356, >>> value=1317 >>> a3c9cfed0a50a9f199ed42f2730 >>> 1264761195240/6ffc3fe659023 column=data:alt, timestamp=1267006115356, >>> value=0 >>> a3c9cfed0a50a9f199ed42f2730 >>> 1264761195240/6ffc3fe659023 column=data:country, >>> timestamp=1267006115356, >>> value=France >>> a3c9cfed0a50a9f199ed42f2730 >>> 1264761195240/6ffc3fe659023 column=data:countrycode,
-
Re: LZO vs GZIP vs NO COMPREESSION: why is GZIP the winner ???Dan Washusen 2010-03-01, 00:20
My (very rough) calculation of the data size came up with around 50MB. That
was assuming 400 bytes * 100,000 for the values, 32 + 8 * 13 * 100,000 for the keys and an extra meg or two for extra key stuff. I didn't understand how that resulted in the a region split, so I assume we are still missing some information (or I made a mistake). As you mention, that should mean that everything is in the MemStore and compression has not come into play yet. Puzzling... On PE; there isn't currently a way to specify compression options on the testtable without extending PE and overriding org.apache.hadoop.hbase.PerformanceEvaluation#getTableDescriptor method. Maybe it could be added as an option? Cheers, Dan On 1 March 2010 10:56, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: > As Dan said, your data is so small you don't really trigger many > different behaviors in HBase, it could very well kept mostly in the > memstores where compression has no impact at all. > > WRT a benchmark, there's the PerformanceEvaluation (we call it PE for > short) which is well maintained and lets you set a compression level. > This page has an outdated help but it shows you how to run it: > http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation > > Another option is importing the wikipedia dump, which is highly > compressible and not manufactured like the PE. Last summer I wrote a > small MR job to do the import easily and although the code is based on > a dev version 0.20.0, it should be fairly easy to make it work on > 0.20.3 (probably just replacing the libs). See > http://code.google.com/p/hbase-wikipedia-loader/ > > See the last paragraph of the Getting Started in the Wiki, I show some > import numbers: > > "For example, it took 29 min on a 6 nodes cluster (1 master and 5 > region servers) with the same hardware (AMD Phenom(tm) 9550 Quad, 8GB, > 2x1TB disks), 2 map slot per task tracker (that's 10 parallel maps), > and GZ compression. With LZO and a new table it took 23 min 20 ses. > Compressed the table is 32 regions big, uncompressed it's 93 and took > 30 min 10 sec to import." > > You can see that the import was a lot faster on LZO. I didn't do any > reading test tho... > > Good luck! > > J-D > > On Sun, Feb 28, 2010 at 9:30 AM, Vincent Barat <[EMAIL PROTECTED]> > wrote: > > The impact of my cluster architecture on the performances is obviously > the > > same in my 3 test cases. Providing that I only change the compression > type > > between tests, I don't understand why changing the number of regions or > > whatever else would change the speed ratio between my tests, especially > > between the GZIP & LZO tests. > > > > Is there some ready to use and easy to setup benchmarks I could use to > try > > to reproduce the issue in a well known environment ? > > > > Le 25/02/10 19:29, Jean-Daniel Cryans a écrit : > >> > >> If only 1 region, providing more than one nodes will probably just > >> slow down the test since the load is handled by one machine which has > >> to replicate blocks 2 times. I think your test would have much more > >> value if you really grew at least to 10 regions. Also make sure to run > >> the tests more than once on completely new hbase setups (drop table + > >> restart should be enough). > >> > >> May I also recommend upgrading to hbase 0.20.3? It will provide a > >> better experience in general. > >> > >> J-D > >> > >> On Thu, Feb 25, 2010 at 2:49 AM, Vincent Barat<[EMAIL PROTECTED] > > > >> wrote: > >>> > >>> Unfortunately I can post only some snapshots. > >>> > >>> I have no region split (I insert just 100000 rows so there is no split, > >>> except when I don't use compression). > >>> > >>> I use HBase 0.20.2 and to insert I use the HTable.put(list<Put>); > >>> > >>> The only difference between my 3 tests is the way I create the test > >>> table: > >>> > >>> HBaseAdmin admin = new HBaseAdmin(config); > >>> > >>> HTableDescriptor desc = new HTableDescriptor(name); > >>> > >>> HColumnDescriptor colDesc;
-
Re: LZO vs GZIP vs NO COMPREESSION: why is GZIP the winner ???Jean-Daniel Cryans 2010-03-01, 00:24
Oh sorry I was looking at the trunk code (as usual) which has
compression and many other features. It's not in the 0.20 branch. J-D On Sun, Feb 28, 2010 at 4:20 PM, Dan Washusen <[EMAIL PROTECTED]> wrote: > My (very rough) calculation of the data size came up with around 50MB. That > was assuming 400 bytes * 100,000 for the values, 32 + 8 * 13 * 100,000 for > the keys and an extra meg or two for extra key stuff. I didn't understand > how that resulted in the a region split, so I assume we are still missing > some information (or I made a mistake). As you mention, that should mean > that everything is in the MemStore and compression has not come into play > yet. Puzzling... > > On PE; there isn't currently a way to specify compression options on the > testtable without extending PE and overriding > org.apache.hadoop.hbase.PerformanceEvaluation#getTableDescriptor method. > Maybe it could be added as an option? > > Cheers, > Dan > > On 1 March 2010 10:56, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: > >> As Dan said, your data is so small you don't really trigger many >> different behaviors in HBase, it could very well kept mostly in the >> memstores where compression has no impact at all. >> >> WRT a benchmark, there's the PerformanceEvaluation (we call it PE for >> short) which is well maintained and lets you set a compression level. >> This page has an outdated help but it shows you how to run it: >> http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation >> >> Another option is importing the wikipedia dump, which is highly >> compressible and not manufactured like the PE. Last summer I wrote a >> small MR job to do the import easily and although the code is based on >> a dev version 0.20.0, it should be fairly easy to make it work on >> 0.20.3 (probably just replacing the libs). See >> http://code.google.com/p/hbase-wikipedia-loader/ >> >> See the last paragraph of the Getting Started in the Wiki, I show some >> import numbers: >> >> "For example, it took 29 min on a 6 nodes cluster (1 master and 5 >> region servers) with the same hardware (AMD Phenom(tm) 9550 Quad, 8GB, >> 2x1TB disks), 2 map slot per task tracker (that's 10 parallel maps), >> and GZ compression. With LZO and a new table it took 23 min 20 ses. >> Compressed the table is 32 regions big, uncompressed it's 93 and took >> 30 min 10 sec to import." >> >> You can see that the import was a lot faster on LZO. I didn't do any >> reading test tho... >> >> Good luck! >> >> J-D >> >> On Sun, Feb 28, 2010 at 9:30 AM, Vincent Barat <[EMAIL PROTECTED]> >> wrote: >> > The impact of my cluster architecture on the performances is obviously >> the >> > same in my 3 test cases. Providing that I only change the compression >> type >> > between tests, I don't understand why changing the number of regions or >> > whatever else would change the speed ratio between my tests, especially >> > between the GZIP & LZO tests. >> > >> > Is there some ready to use and easy to setup benchmarks I could use to >> try >> > to reproduce the issue in a well known environment ? >> > >> > Le 25/02/10 19:29, Jean-Daniel Cryans a écrit : >> >> >> >> If only 1 region, providing more than one nodes will probably just >> >> slow down the test since the load is handled by one machine which has >> >> to replicate blocks 2 times. I think your test would have much more >> >> value if you really grew at least to 10 regions. Also make sure to run >> >> the tests more than once on completely new hbase setups (drop table + >> >> restart should be enough). >> >> >> >> May I also recommend upgrading to hbase 0.20.3? It will provide a >> >> better experience in general. >> >> >> >> J-D >> >> >> >> On Thu, Feb 25, 2010 at 2:49 AM, Vincent Barat<[EMAIL PROTECTED] >> > >> >> wrote: >> >>> >> >>> Unfortunately I can post only some snapshots. >> >>> >> >>> I have no region split (I insert just 100000 rows so there is no split, >> >>> except when I don't use compression). >> >>> >> >>> I use HBase 0.20.2 and to insert I use the HTable.put(list<Put>);
-
Re: LZO vs GZIP vs NO COMPREESSION: why is GZIP the winner ???Vincent Barat 2010-03-01, 15:18
Le 01/03/10 01:20, Dan Washusen a �crit : > My (very rough) calculation of the data size came up with around 50MB. That > was assuming 400 bytes * 100,000 for the values, 32 + 8 * 13 * 100,000 for > the keys and an extra meg or two for extra key stuff. I didn't understand > how that resulted in the a region split, so I assume we are still missing > some information (or I made a mistake). As you mention, that should mean > that everything is in the MemStore and compression has not come into play > yet. Puzzling... You are right, there is no region split when I use no compression. Nevertheless, as you say, if everything is in the memstore, how can it be that I see a so big difference between my tests ? > > On PE; there isn't currently a way to specify compression options on the > testtable without extending PE and overriding > org.apache.hadoop.hbase.PerformanceEvaluation#getTableDescriptor method. > Maybe it could be added as an option? > > Cheers, > Dan > > On 1 March 2010 10:56, Jean-Daniel Cryans<[EMAIL PROTECTED]> wrote: > >> As Dan said, your data is so small you don't really trigger many >> different behaviors in HBase, it could very well kept mostly in the >> memstores where compression has no impact at all. >> >> WRT a benchmark, there's the PerformanceEvaluation (we call it PE for >> short) which is well maintained and lets you set a compression level. >> This page has an outdated help but it shows you how to run it: >> http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation >> >> Another option is importing the wikipedia dump, which is highly >> compressible and not manufactured like the PE. Last summer I wrote a >> small MR job to do the import easily and although the code is based on >> a dev version 0.20.0, it should be fairly easy to make it work on >> 0.20.3 (probably just replacing the libs). See >> http://code.google.com/p/hbase-wikipedia-loader/ >> >> See the last paragraph of the Getting Started in the Wiki, I show some >> import numbers: >> >> "For example, it took 29 min on a 6 nodes cluster (1 master and 5 >> region servers) with the same hardware (AMD Phenom(tm) 9550 Quad, 8GB, >> 2x1TB disks), 2 map slot per task tracker (that's 10 parallel maps), >> and GZ compression. With LZO and a new table it took 23 min 20 ses. >> Compressed the table is 32 regions big, uncompressed it's 93 and took >> 30 min 10 sec to import." >> >> You can see that the import was a lot faster on LZO. I didn't do any >> reading test tho... >> >> Good luck! >> >> J-D >> >> On Sun, Feb 28, 2010 at 9:30 AM, Vincent Barat<[EMAIL PROTECTED]> >> wrote: >>> The impact of my cluster architecture on the performances is obviously >> the >>> same in my 3 test cases. Providing that I only change the compression >> type >>> between tests, I don't understand why changing the number of regions or >>> whatever else would change the speed ratio between my tests, especially >>> between the GZIP& LZO tests. >>> >>> Is there some ready to use and easy to setup benchmarks I could use to >> try >>> to reproduce the issue in a well known environment ? >>> >>> Le 25/02/10 19:29, Jean-Daniel Cryans a �crit : >>>> >>>> If only 1 region, providing more than one nodes will probably just >>>> slow down the test since the load is handled by one machine which has >>>> to replicate blocks 2 times. I think your test would have much more >>>> value if you really grew at least to 10 regions. Also make sure to run >>>> the tests more than once on completely new hbase setups (drop table + >>>> restart should be enough). >>>> >>>> May I also recommend upgrading to hbase 0.20.3? It will provide a >>>> better experience in general. >>>> >>>> J-D >>>> >>>> On Thu, Feb 25, 2010 at 2:49 AM, Vincent Barat<[EMAIL PROTECTED] >>> >>>> wrote: >>>>> >>>>> Unfortunately I can post only some snapshots. >>>>> >>>>> I have no region split (I insert just 100000 rows so there is no split, >>>>> except when I don't use compression). >>>>> >>>>> I use HBase 0.20.2 and to insert I use the HTable.put(list<Put>);
-
Re: LZO vs GZIP vs NO COMPREESSION: why is GZIP the winner ???Jean-Daniel Cryans 2010-03-01, 19:16
> You are right, there is no region split when I use no compression.
> Nevertheless, as you say, if everything is in the memstore, how can it be > that I see a so big difference between my tests ? Well did you run your test more than once? Do you see the exact same results every time? IMO at that scale the differences could be stuff like the memstore got flushed during one test and not the other. I really really recommend testing with more data. J-D
-
Re: LZO vs GZIP vs NO COMPREESSION: why is GZIP the winner ???Vincent Barat 2010-03-02, 10:42
Yes, I run my tests several times and the difference was stable.
Le 01/03/10 20:16, Jean-Daniel Cryans a �crit : >> You are right, there is no region split when I use no compression. >> Nevertheless, as you say, if everything is in the memstore, how can it be >> that I see a so big difference between my tests ? > > Well did you run your test more than once? Do you see the exact same > results every time? IMO at that scale the differences could be stuff > like the memstore got flushed during one test and not the other. I > really really recommend testing with more data. > > J-D > |