To add some color... HBase will store version of KeyValues next to each other (at least after a compaction).
If your queries typically request most of the versions of a KV that works out nicely.
If, however, you typically query only the latest version or a specific version then HBase will load all other versions of the KV that happens to be on the same block.
That can be pretty inefficient, up to the point where scanning would require loading a new block for each KV.
HBase currently does not have a good story for the latter scenario. Solutions include a custom compaction policy that separates data along date ranges. That way HBase can rule out entire HFiles if they do not fall into the request time range of the query.
From: Vladimir Rodionov <[EMAIL PROTECTED]>
To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
Sent: Friday, December 6, 2013 10:33 AM
Subject: RE: Practical Upper Limit on Number of Version Stored?
Both: columns and timestamps are valid choices. Events have sources and in my approach source is in rowkey and time is in timestamp.
In your approach you embed time into column qualifier.
Its easy to get last N events in my approach using "Give first N key-values"-type of Filter in your approach you need the same type of filter.
TTL will expire old events in both cases.
>Suppose you have event A occurring at time X.
>Then you have event B occurring at time X2.
>Are they the same?
>Based on the OPs limited description A and B are not.
>So why store them as versions as if they were the same?
There are no such things as "same" events. Frankly speaking, I am not following you here.
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: [EMAIL PROTECTED]
From: Michael Segel [[EMAIL PROTECTED]]
Sent: Friday, December 06, 2013 4:23 AM
To: [EMAIL PROTECTED]
Subject: Re: Practical Upper Limit on Number of Version Stored?
Just because you can do something, doesn't mean its a good idea.
From a design perspective its not a good idea.
Ask yourself why does versioning exist? What purpose does versioning serve in HBase?
From a design perspective you have to ask yourself what are you attempting to do.
Here the OP says ..
"I guess I don't really understand why I wouldn't want to do this. For our use case we only really care about the user's last 50 to 200 events. We don't really care about deleting events explicitly. More than likely we would enable a TTL to get rid of events older than a certain time. "
So his goal is to get the last N events first.
Remember columns are in sort order.
So if you have Event-XXXX or XXXX-Event as your column identifier (name), where XXXX is (Epoc - timestamp) ...
You will have your events in last event first.
This not only achieves what the OP wants, but ... I seem to recall some people posting here about methods to only return N results from a row at a time?
And here's the kicker...
From a design perspective...
Suppose you have event A occurring at time X.
Then you have event B occurring at time X2.
Are they the same?
Based on the OPs limited description A and B are not.
So why store them as versions as if they were the same?
Versioning may make sense if we were talking about an RSVP to a function.
At time T, Bob, may RSVP 'yes'.
At time T1, Bob may RSVP 'tentative'.
At time T2, Bob may RSVP, 'no'.
Each version is describing the same object.
Does that make sense?
Good design is critical...
Just putting it out there. ;-)
On Dec 5, 2013, at 9:50 PM, Vladimir Rodionov <[EMAIL PROTECTED]> wrote:
> Version is just a timestamp (event time) => naturally fits time-series (event) types of data.
> Besides this, events are immutable objects, if they are not, not than they are not events.
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: [EMAIL PROTECTED]
The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental.
Use at your own risk.
michael_segel (AT) hotmail.com