|
|
-
When does compaction actually occur?
Tom Brown 2012-06-01, 22:59
I have a table that holds rotating data. It has a TTL of 3600. For some reason, when I scan the table I still get old cells that are much older than that TTL.
I have tried issuing a compaction request via the web UI, but that didn't seem to do anything.
Am I misunderstanding the data model used by HBase? Is there anything else I can check to verify the functionality of my integration?
I am using HBase 0.92 with Hadoop 1.0.2.
Thanks in advance!
--Tom
-
Re: When does compaction actually occur?
Amandeep Khurana 2012-06-01, 23:04
Tom,
Old cells will get deleted as a part of the next major compaction, which is typically recommended to be done once a day, when the load on the system is at its lowest.
FWIW… To have a TTL of 3600 take effect, you'll have to do a major compaction every hour, which is an expensive operation specially at scale. Chances are that your I/O loads will shoot up and latencies will spike for operations to HBase. Can you tell us why a TTL of 3600s is of interest? What are your access patterns?
-Amandeep On Friday, June 1, 2012 at 3:59 PM, Tom Brown wrote:
> I have a table that holds rotating data. It has a TTL of 3600. For > some reason, when I scan the table I still get old cells that are much > older than that TTL. > > I have tried issuing a compaction request via the web UI, but that > didn't seem to do anything. > > Am I misunderstanding the data model used by HBase? Is there anything > else I can check to verify the functionality of my integration? > > I am using HBase 0.92 with Hadoop 1.0.2. > > Thanks in advance! > > --Tom
-
Re: When does compaction actually occur?
lars hofhansl 2012-06-02, 12:42
A scan should *never* should you expired cells (unless you're doing a "raw" scan).
If cells haven't been collected, yet, they'll be filtered by the scan. In any case the expired cells are not returned by the scan. Can you tell us more details? The scan code, the timestamps you get, a describe of your column families, etc.
Thanks.
-- Lars
________________________________ From: Tom Brown <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Friday, June 1, 2012 3:59 PM Subject: When does compaction actually occur? I have a table that holds rotating data. It has a TTL of 3600. For some reason, when I scan the table I still get old cells that are much older than that TTL.
I have tried issuing a compaction request via the web UI, but that didn't seem to do anything.
Am I misunderstanding the data model used by HBase? Is there anything else I can check to verify the functionality of my integration?
I am using HBase 0.92 with Hadoop 1.0.2.
Thanks in advance!
--Tom
-
Re: When does compaction actually occur?
Doug Meil 2012-06-02, 13:16
Related to "when does compaction actually occur?", although the original question was about the web UI you might also want to see this... http://hbase.apache.org/book.html#regions.archŠ for an overview of the compaction file-selection algorithm. On 6/2/12 8:42 AM, "lars hofhansl" <[EMAIL PROTECTED]> wrote: >A scan should *never* should you expired cells (unless you're doing a >"raw" scan). > >If cells haven't been collected, yet, they'll be filtered by the scan. In >any case the expired cells are not returned by the scan. > > >Can you tell us more details? >The scan code, the timestamps you get, a describe of your column >families, etc. > >Thanks. > >-- Lars > > > >________________________________ > From: Tom Brown <[EMAIL PROTECTED]> >To: [EMAIL PROTECTED] >Sent: Friday, June 1, 2012 3:59 PM >Subject: When does compaction actually occur? > >I have a table that holds rotating data. It has a TTL of 3600. For >some reason, when I scan the table I still get old cells that are much >older than that TTL. > >I have tried issuing a compaction request via the web UI, but that >didn't seem to do anything. > >Am I misunderstanding the data model used by HBase? Is there anything >else I can check to verify the functionality of my integration? > >I am using HBase 0.92 with Hadoop 1.0.2. > >Thanks in advance! > >--Tom
-
Re: When does compaction actually occur?
Lars George 2012-06-03, 08:57
What Amandeep says and also keep in mind that with the current selection process HBase holds O(log N) files for N data. So say for 2GB region sizes you get 2-3 files. This means it very "aggressively" is compacting files, and most of these are "all files included" once... which are the promoted to major compactions implicitly. That way your predicate deletes should be in effect and you will only need scheduled major compactions only ever so often.
Lars
On Jun 2, 2012, at 1:04 AM, Amandeep Khurana wrote:
> Tom, > > Old cells will get deleted as a part of the next major compaction, which is typically recommended to be done once a day, when the load on the system is at its lowest. > > FWIW… To have a TTL of 3600 take effect, you'll have to do a major compaction every hour, which is an expensive operation specially at scale. Chances are that your I/O loads will shoot up and latencies will spike for operations to HBase. Can you tell us why a TTL of 3600s is of interest? What are your access patterns? > > -Amandeep > > > On Friday, June 1, 2012 at 3:59 PM, Tom Brown wrote: > >> I have a table that holds rotating data. It has a TTL of 3600. For >> some reason, when I scan the table I still get old cells that are much >> older than that TTL. >> >> I have tried issuing a compaction request via the web UI, but that >> didn't seem to do anything. >> >> Am I misunderstanding the data model used by HBase? Is there anything >> else I can check to verify the functionality of my integration? >> >> I am using HBase 0.92 with Hadoop 1.0.2. >> >> Thanks in advance! >> >> --Tom >
-
Re: When does compaction actually occur?
Tom Brown 2012-06-05, 21:37
Lars,
In response to your earlier email, I'm not completely sure whether or not I'm using a raw scan. The scan is performed in a region server coprocessor initialized as such:
Scan scan = new Scan() .setMaxVersions(1) .setTimeRange(myMinTimeStamp, myMaxTimeStamp) .setStartRow(myStartRow) .setStopRow(myStopRow); scan.setCaching(1000);
InternalScanner scanner = ((RegionCoprocessorEnvironment) getEnvironment()) .getRegion().getScanner(scan);
The scan is indeed being filtered to the range I provide (using setTimeRange), but it will retrieve records much older than should be allowed given the TTL.
I have multiple tables setup in a similar fashion, but here a description of one of them:
{NAME => 'facts', FAMILIES => [{NAME => 'd', BLOOMFILTER => 'ROW', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '3600', MIN_VERSIONS => '1'}]} I'm building an OLAP cube for this project and want to make sure the data size doesn't grow through the roof. Whether or not data expires after exactly one hour is not an absolute requirement for this use case. But I want to know why the system is not behaving as I think I configured it to behave.
Thanks!
--Tom
On Sun, Jun 3, 2012 at 2:57 AM, Lars George <[EMAIL PROTECTED]> wrote: > What Amandeep says and also keep in mind that with the current selection process HBase holds O(log N) files for N data. So say for 2GB region sizes you get 2-3 files. This means it very "aggressively" is compacting files, and most of these are "all files included" once... which are the promoted to major compactions implicitly. That way your predicate deletes should be in effect and you will only need scheduled major compactions only ever so often. > > Lars > > On Jun 2, 2012, at 1:04 AM, Amandeep Khurana wrote: > >> Tom, >> >> Old cells will get deleted as a part of the next major compaction, which is typically recommended to be done once a day, when the load on the system is at its lowest. >> >> FWIW… To have a TTL of 3600 take effect, you'll have to do a major compaction every hour, which is an expensive operation specially at scale. Chances are that your I/O loads will shoot up and latencies will spike for operations to HBase. Can you tell us why a TTL of 3600s is of interest? What are your access patterns? >> >> -Amandeep >> >> >> On Friday, June 1, 2012 at 3:59 PM, Tom Brown wrote: >> >>> I have a table that holds rotating data. It has a TTL of 3600. For >>> some reason, when I scan the table I still get old cells that are much >>> older than that TTL. >>> >>> I have tried issuing a compaction request via the web UI, but that >>> didn't seem to do anything. >>> >>> Am I misunderstanding the data model used by HBase? Is there anything >>> else I can check to verify the functionality of my integration? >>> >>> I am using HBase 0.92 with Hadoop 1.0.2. >>> >>> Thanks in advance! >>> >>> --Tom >> >
-
Re: When does compaction actually occur?
lars hofhansl 2012-06-06, 09:06
Hi Tom,
You have set MIN_VERSIONS to 1. That tells HBase that for this column family you want to keep at least 1 version of a cell around regardless of whether it expired (due to TTL) or not. I think if you remove that it will behave as you expect. As a general rule a compaction will never influence visibility of data that was inserted before the compaction (except for RAW scans), and hence you should never need to ask when a compaction happens - unless you are running out of disk space. -- Lars ________________________________ From: Tom Brown <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tuesday, June 5, 2012 2:37 PM Subject: Re: When does compaction actually occur?
Lars,
In response to your earlier email, I'm not completely sure whether or not I'm using a raw scan. The scan is performed in a region server coprocessor initialized as such:
Scan scan = new Scan() .setMaxVersions(1) .setTimeRange(myMinTimeStamp, myMaxTimeStamp) �� .setStartRow(myStartRow) .setStopRow(myStopRow); �� scan.setCaching(1000);
�� InternalScanner scanner = ((RegionCoprocessorEnvironment) getEnvironment()) �� .getRegion().getScanner(scan);
The scan is indeed being filtered to the range I provide (using setTimeRange), but it will retrieve records much older than should be allowed given the TTL.
I have multiple tables setup in a similar fashion, but here a description of one of them:
{NAME => 'facts', FAMILIES => [{NAME => 'd', BLOOMFILTER => 'ROW', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '3600', MIN_VERSIONS => '1'}]} I'm building an OLAP cube for this project and want to make sure the data size doesn't grow through the roof. Whether or not data expires after exactly one hour is not an absolute requirement for this use case. But I want to know why the system is not behaving as I think I configured it to behave.
Thanks!
--Tom
On Sun, Jun 3, 2012 at 2:57 AM, Lars George <[EMAIL PROTECTED]> wrote: > What Amandeep says and also keep in mind that with the current selection process HBase holds O(log N) files for N data. So say for 2GB region sizes you get 2-3 files. This means it very "aggressively" is compacting files, and most of these are "all files included" once... which are the promoted to major compactions implicitly. That way your predicate deletes should be in effect and you will only need scheduled major compactions only ever so often. > > Lars > > On Jun 2, 2012, at 1:04 AM, Amandeep Khurana wrote: > >> Tom, >> >> Old cells will get deleted as a part of the next major compaction, which is typically recommended to be done once a day, when the load on the system is at its lowest. >> >> FWIW�� To have a TTL of 3600 take effect, you'll have to do a major compaction every hour, which is an expensive operation specially at scale. Chances are that your I/O loads will shoot up and latencies will spike for operations to HBase. Can you tell us why a TTL of 3600s is of interest? What are your access patterns? >> >> -Amandeep >> >> >> On Friday, June 1, 2012 at 3:59 PM, Tom Brown wrote: >> >>> I have a table that holds rotating data. It has a TTL of 3600. For >>> some reason, when I scan the table I still get old cells that are much >>> older than that TTL. >>> >>> I have tried issuing a compaction request via the web UI, but that >>> didn't seem to do anything. >>> >>> Am I misunderstanding the data model used by HBase? Is there anything >>> else I can check to verify the functionality of my integration? >>> >>> I am using HBase 0.92 with Hadoop 1.0.2. >>> >>> Thanks in advance! >>> >>> --Tom >> >
|
|