I've looked into this in the past, and I haven't implemented anything yet.
But I have a couple notes:
1) From what I can tell hbase doesn't currently provide you with an API you
could use to figure this out smartly. (I was looking at 0.90.x, it could
have changed in later versions).
2) What seemed to me to be a good way to do it was to do it based on a
combination of oldest modified time and number of store files. I was going
to write a script which iterates all the regions in HDFS, chooses the
region (or up to N regions) which had either the most files or the files
with the oldest modified timestamp, and major_compact those.
3) At the end of the day, our servers were not really utilizing 100% of
disk and CPU so we decided to just major compact everything each night. We
staggered the compactions over a couple hours so as not to overwhelm, but
not sure if that has much effect since it is in serial in a single thread
On Wed, Dec 12, 2012 at 3:19 PM, Otis Gospodnetic <
[EMAIL PROTECTED]> wrote:
> If you want to do major compaction on a single region at a time (to
> minimize the impact on the cluster), how do you pick which region to
> What should one look for in order to get the best ROI out of major
> compaction - the best ratio of the negative impact and positive benefit
> - and is there a programmatic way to get to this information, so region
> selection+compaction can be automated?
> HBASE Performance Monitoring - http://sematext.com/spm/index.html
> Search Analytics - http://sematext.com/search-analytics/index.html