|
|
-
Smart Managed Major Compactions
Bryan Beaudreault 2012-07-18, 17:26
Hello all,
Before I start, I'm running cdh3u2, so 0.90.4.
I am looking into managing major compactions ourselves, but there doesn't appear to be any mechanisms I can hook in to determine which tables need compacting. Ideally each time my cron job runs it would compact the table with the next longest time since compaction, but I can't find a way to access this metric.
The default major compaction algorithm seems to be able to get the oldest modified time for all store files for a region to determine when it was last major compacted. I know this is not ideal, but it seems good enough. Unfortunately I don't see an easy way to get this.
Alternatively I can keep my own compaction log, but I'd rather not have to do that if there is another way. Is there some easy way to access this value that I am not seeing? I know I could construct the paths to store files myself, but this seems less than ideal as well (i.e. might break when we upgrade, etc).
Thanks
-- Bryan Beaudreault
-
Re: Smart Managed Major Compactions
Stack 2012-07-19, 00:52
On Wed, Jul 18, 2012 at 7:26 PM, Bryan Beaudreault <[EMAIL PROTECTED]> wrote: > I am looking into managing major compactions ourselves, but there doesn't appear to be any mechanisms I can hook in to determine which tables need compacting. Ideally each time my cron job runs it would compact the table with the next longest time since compaction, but I can't find a way to access this metric. >
Would suggest you have a region-view rather than a table-view.
Internally, we look at the hdfs modification time when we check if we are to compact. If it is > whatever the major compaction interval set for the particular column family is, we'll do a major compaction.
Running an external script, you could look at each region in turn on occasion. Look at its files. Check their modification time (and you perhaps how many files there are under the region column family) and if its > whatever you want, run a major compaction on the region.
Try to balance how many you'd have running at a time.
> The default major compaction algorithm seems to be able to get the oldest modified time for all store files for a region to determine when it was last major compacted. I know this is not ideal, but it seems good enough. Unfortunately I don't see an easy way to get this. >
Its in the stats datastructure for an hdfs file. Scripting you could parse it from an hdfs listing. St.Ack
|
|
All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by
Sematext