Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Major compactions and OS cache

Copy link to this message
Major compactions and OS cache

Over on http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html I saw
this bit:

"The most important factor is that HBase is not restarted frequently and that it

performs house keeping on a regular basis. These so called compactions rewrite
files as new data is added over time. All files in HDFS once written are
immutable (for all sorts of reasons). Because of  that, data is written into
new files and as their number grows HBase compacts them into another set of
new, consolidated files. And here is  the kicker: HDFS is smart enough to put
the data where it is needed!"

... and I always wondered what this does to the OS cache.  In some applications
(non-HBase stuff, say full-text search), the OS cache plays a crucial role in
how the system performs.  If you have to hit the disk too much, you're in
trouble, so one of the things you avoid is making big changes to index files on
disk in order to avoid invalidating data that's been nicely cached by the OS.

However, with HBase, and especially major compactions, what happens with the OS
cache?  All gone, right?
Do people find this problematic?
Or does the OS cache simply not play such a significant role in systems running
HBase simply because the data it holds and that needs to be accessed is much
bigger than the OS cache could ever be, so even with the OS cache full and hot,
other data would still have to be read from disk anyway?