Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - LZO vs GZIP vs NO COMPREESSION: why is GZIP the winner ???

Copy link to this message
Re: LZO vs GZIP vs NO COMPREESSION: why is GZIP the winner ???
Vincent Barat 2010-03-01, 15:18

Le 01/03/10 01:20, Dan Washusen a �crit :
> My (very rough) calculation of the data size came up with around 50MB.  That
> was assuming 400 bytes * 100,000 for the values, 32 + 8 * 13 * 100,000 for
> the keys and an extra meg or two for extra key stuff.  I didn't understand
> how that resulted in the a region split, so I assume we are still missing
> some information (or I made a mistake).  As you mention, that should mean
> that everything is in the MemStore and compression has not come into play
> yet.  Puzzling...

You are right, there is no region split when I use no compression.
Nevertheless, as you say, if everything is in the memstore, how can
it be that I see a so big difference between my tests ?

> On PE; there isn't currently a way to specify compression options on the
> testtable without extending PE and overriding
> org.apache.hadoop.hbase.PerformanceEvaluation#getTableDescriptor method.
>   Maybe it could be added as an option?
> Cheers,
> Dan
> On 1 March 2010 10:56, Jean-Daniel Cryans<[EMAIL PROTECTED]>  wrote:
>> As Dan said, your data is so small you don't really trigger many
>> different behaviors in HBase, it could very well kept mostly in the
>> memstores where compression has no impact at all.
>> WRT a benchmark, there's the PerformanceEvaluation (we call it PE for
>> short) which is well maintained and lets you set a compression level.
>> This page has an outdated help but it shows you how to run it:
>> http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation
>> Another option is importing the wikipedia dump, which is highly
>> compressible and not manufactured like the PE. Last summer I wrote a
>> small MR job to do the import easily and although the code is based on
>> a dev version 0.20.0, it should be fairly easy to make it work on
>> 0.20.3 (probably just replacing the libs). See
>> http://code.google.com/p/hbase-wikipedia-loader/
>> See the last paragraph of the Getting Started in the Wiki, I show some
>> import numbers:
>> "For example, it took 29 min on a 6 nodes cluster (1 master and 5
>> region servers) with the same hardware (AMD Phenom(tm) 9550 Quad, 8GB,
>> 2x1TB disks), 2 map slot per task tracker (that's 10 parallel maps),
>> and GZ compression. With LZO and a new table it took 23 min 20 ses.
>> Compressed the table is 32 regions big, uncompressed it's 93 and took
>> 30 min 10 sec to import."
>> You can see that the import was a lot faster on LZO. I didn't do any
>> reading test tho...
>> Good luck!
>> J-D
>> On Sun, Feb 28, 2010 at 9:30 AM, Vincent Barat<[EMAIL PROTECTED]>
>> wrote:
>>> The impact of my cluster architecture on the performances is obviously
>> the
>>> same in my 3 test cases. Providing that I only change the compression
>> type
>>> between tests, I don't understand why changing the number of regions or
>>> whatever else would change the speed ratio between my tests, especially
>>> between the GZIP&  LZO tests.
>>> Is there some ready to use and easy to setup benchmarks I could use to
>> try
>>> to reproduce the issue in a well known environment ?
>>> Le 25/02/10 19:29, Jean-Daniel Cryans a �crit :
>>>> If only 1 region, providing more than one nodes will probably just
>>>> slow down the test since the load is handled by one machine which has
>>>> to replicate blocks 2 times. I think your test would have much more
>>>> value if you really grew at least to 10 regions. Also make sure to run
>>>> the tests more than once on completely new hbase setups (drop table +
>>>> restart should be enough).
>>>> May I also recommend upgrading to hbase 0.20.3? It will provide a
>>>> better experience in general.
>>>> J-D
>>>> On Thu, Feb 25, 2010 at 2:49 AM, Vincent Barat<[EMAIL PROTECTED]
>>>>   wrote:
>>>>> Unfortunately I can post only some snapshots.
>>>>> I have no region split (I insert just 100000 rows so there is no split,
>>>>> except when I don't use compression).
>>>>> I use HBase 0.20.2 and to insert I use the HTable.put(list<Put>);