Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> When does compaction actually occur?


Copy link to this message
-
Re: When does compaction actually occur?
Lars,

In response to your earlier email, I'm not completely sure whether or
not I'm using a raw scan. The scan is performed in a region server
coprocessor initialized as such:

Scan scan = new Scan()
.setMaxVersions(1)
.setTimeRange(myMinTimeStamp, myMaxTimeStamp)
.setStartRow(myStartRow)
.setStopRow(myStopRow);
scan.setCaching(1000);

InternalScanner scanner = ((RegionCoprocessorEnvironment) getEnvironment())
.getRegion().getScanner(scan);

The scan is indeed being filtered to the range I provide (using
setTimeRange), but it will retrieve records much older than should be
allowed given the TTL.

I have multiple tables setup in a similar fashion, but here a
description of one of them:

{NAME => 'facts', FAMILIES => [{NAME => 'd', BLOOMFILTER => 'ROW',
COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '3600', MIN_VERSIONS
=> '1'}]}
I'm building an OLAP cube for this project and want to make sure the
data size doesn't grow through the roof. Whether or not data expires
after exactly one hour is not an absolute requirement for this use
case. But I want to know why the system is not behaving as I think I
configured it to behave.

Thanks!

--Tom

On Sun, Jun 3, 2012 at 2:57 AM, Lars George <[EMAIL PROTECTED]> wrote:
> What Amandeep says and also keep in mind that with the current selection process HBase holds O(log N) files for N data. So say for 2GB region sizes you get 2-3 files. This means it very "aggressively" is compacting files, and most of these are "all files included" once... which are the promoted to major compactions implicitly. That way your predicate deletes should be in effect and you will only need scheduled major compactions only ever so often.
>
> Lars
>
> On Jun 2, 2012, at 1:04 AM, Amandeep Khurana wrote:
>
>> Tom,
>>
>> Old cells will get deleted as a part of the next major compaction, which is typically recommended to be done once a day, when the load on the system is at its lowest.
>>
>> FWIW… To have a TTL of 3600 take effect, you'll have to do a major compaction every hour, which is an expensive operation specially at scale. Chances are that your I/O loads will shoot up and latencies will spike for operations to HBase. Can you tell us why a TTL of 3600s is of interest? What are your access patterns?
>>
>> -Amandeep
>>
>>
>> On Friday, June 1, 2012 at 3:59 PM, Tom Brown wrote:
>>
>>> I have a table that holds rotating data. It has a TTL of 3600. For
>>> some reason, when I scan the table I still get old cells that are much
>>> older than that TTL.
>>>
>>> I have tried issuing a compaction request via the web UI, but that
>>> didn't seem to do anything.
>>>
>>> Am I misunderstanding the data model used by HBase? Is there anything
>>> else I can check to verify the functionality of my integration?
>>>
>>> I am using HBase 0.92 with Hadoop 1.0.2.
>>>
>>> Thanks in advance!
>>>
>>> --Tom
>>
>