Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Documenting Guidance on compression and codecs


Copy link to this message
-
Re: Documenting Guidance on compression and codecs
PE has short and unique keys, so any prefix encoding won't buy much (or make it worse).

What's interesting to me is the difference between snappy and lzo, I expected them to be mostly equivalent in terms of compression.

So as a general guideline I'd say:
o If you have long keys (compared to the values) or many columns, use a prefix encoder. Only use FAST_DIFF.
o If the values are large (and not precompressed as in images), use a block compressor (SNAPPY, LZO, GZIP, etc)
o Use GZIP for cold data
o Use SNAPPY or LZO for hot data.
o In most cases you do want to enable SNAPPY or LZO by default (low perf overhead + space savings).

-- Lars

________________________________
 From: Nick Dimiduk <[EMAIL PROTECTED]>
To: hbase-dev <[EMAIL PROTECTED]>
Sent: Wednesday, September 11, 2013 12:10 PM
Subject: Documenting Guidance on compression and codecs
 

Do we have a consolidated resource with information and recommendations
about use of the above? For instance, I ran a simple test using
PerformanceEvaluation, examining just the size of data on disk for 1G of
input data. The matrix below has some surprising results:

+--------------------+--------------+
| MODIFIER           | SIZE (bytes) |
+--------------------+--------------+
| none               |   1108553612 |
+--------------------+--------------+
| compression:SNAPPY |    427335534 |
+--------------------+--------------+
| compression:LZO    |    270422088 |
+--------------------+--------------+
| compression:GZ     |    152899297 |
+--------------------+--------------+
| codec:PREFIX       |   1993910969 |
+--------------------+--------------+
| codec:DIFF         |   1960970083 |
+--------------------+--------------+
| codec:FAST_DIFF    |   1061374722 |
+--------------------+--------------+
| codec:PREFIX_TREE  |   1066586604 |
+--------------------+--------------+

Where does a wayward soul look for guidance on which combination of the
above to choose for their application?

Thanks,
Nick
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB