Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Using HBase timestamps as natural versioning

Copy link to this message
Re: Using HBase timestamps as natural versioning
Is your ID fixed length or variable length ?

If the length is fixed, you can specify ID/0 as the start row in scan.

On Fri, Aug 30, 2013 at 5:42 AM, Henning Blohm <[EMAIL PROTECTED]>wrote:

> Was gone for a few days. Sorry for not getting back to this until now. And
> thanks for adding to the discussion!
> The time used in the timestamp is the "natural" time (in ms resolution) as
> far as known. I.e. in the end it is of course some machine time, but the
> trigger to choose it is some human interaction typically. So there is some
> natural time to events that update a row's data.
> If timestamps happen to differ just by 1 ms, as unlikely as that may be,
> this would still be valid.
> And the timestamp is always set by the client (i.e. the app server) when
> performing an HBase put. So it's never the region server time or something
> slightly arbitrary.
> To recap: The data model (even before mapping to HBase) is essentially
> ID -> ( attribute -> ( time -> value ))
> (where ID is a composite key consisting of some natural elements and some
> surrogate part).
> An event is something like "at time t, attribute x of  ID attained value
> z".
> Events may enter the system out of timely order!
> Typical access patterns are:
> (R1) "Get me all attributes of ID at time t"
> (R2) "Get me a trails of attribute changes between time t0 and t1"
> (W1) "Set x=z on ID for time t"
> As said, currently we store data almost exactly the way I described the
> model above (and probably that's why I wrote it down the way I did) using
> the HBase time stamp to store to time dimension.
> Alternative: Adding the time dimension to the row key
> -----------
> That would mean: ID/time -> (attribute -> value)
> That would imply to either have copies of all (later) attribute values in
> all (later) rows or to only put deltas and to scan over rows to collect
> attribute values.
> Let's assume the latter (for better storage and writing performance).
> Wouldn't that mean to rebuild what HBase does? Is there nothing HBase does
> more efficient when performing R1 for example?
> I.e: Assume I want to get the latest state of row ID. In that case I would
> need to scan from ID/0 to ID/<now> (or reverse) to fish for all attribute
> values (assuming I don't know all expected attributes beforehand). Is that
> as efficient as an HBase get with max versions 1 and <now> as time stamp?
> Thanks,
> Henning
> On 08/21/2013 01:11 PM, Michael Segel wrote:
>> I would have to disagree with Lars on this one...
>> Its really a bad design.
>> To your point, your data is temporal in nature. That is to say, time is
>> an element of your data and it should be part of your schema.
>> You have to remember that time is relative.
>> When a row is entered in to HBase, which time is used in the timestamp?
>> The client(s)? The RS?  Unless I am mistaken or the API has changed, you
>> can set up any arbitrary long value to be the timestamp for a given
>> row/cell.
>> Like I said, its relative.
>> Since your data is temporal what is the difference if the event happened
>> at TS xxxxxxxx10 xxxxxxxxx11 (the point is that the TS is different by 1 in
>> the least significant bit)
>> You could be trying to reference the same event.
>> To Lars point, if you make time part of your key, you could end up with
>> hot spots. It depends on your key design. If its the least significant
>> portion of the key, its less of an issue. (clientX | action | TS) would be
>> an example that would sort the data by client, by action type, then by time
>> stamp.  (EPOCH - TS ) would put the most current first.
>> When you try to take a short cut, it usually will bite you in the ass.
>> TANSTAAFL applies!
>> HTH
>> -Mike
>> On Aug 11, 2013, at 12:21 AM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>>  If you want deletes to work correctly you should enable
>>> KEEP_DELETED_CELLS for your column families (I still think that should be