Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Using HBase timestamps as natural versioning


+
Henning Blohm 2013-08-10, 13:26
+
lars hofhansl 2013-08-11, 05:21
+
Henning Blohm 2013-08-11, 10:36
+
Henning Blohm 2013-08-30, 12:42
Copy link to this message
-
Re: Using HBase timestamps as natural versioning
Is your ID fixed length or variable length ?

If the length is fixed, you can specify ID/0 as the start row in scan.

Cheers
On Fri, Aug 30, 2013 at 5:42 AM, Henning Blohm <[EMAIL PROTECTED]>wrote:

> Was gone for a few days. Sorry for not getting back to this until now. And
> thanks for adding to the discussion!
>
> The time used in the timestamp is the "natural" time (in ms resolution) as
> far as known. I.e. in the end it is of course some machine time, but the
> trigger to choose it is some human interaction typically. So there is some
> natural time to events that update a row's data.
> If timestamps happen to differ just by 1 ms, as unlikely as that may be,
> this would still be valid.
> And the timestamp is always set by the client (i.e. the app server) when
> performing an HBase put. So it's never the region server time or something
> slightly arbitrary.
>
> To recap: The data model (even before mapping to HBase) is essentially
>
> ID -> ( attribute -> ( time -> value ))
>
> (where ID is a composite key consisting of some natural elements and some
> surrogate part).
>
> An event is something like "at time t, attribute x of  ID attained value
> z".
>
> Events may enter the system out of timely order!
>
> Typical access patterns are:
>
> (R1) "Get me all attributes of ID at time t"
> (R2) "Get me a trails of attribute changes between time t0 and t1"
> (W1) "Set x=z on ID for time t"
>
> As said, currently we store data almost exactly the way I described the
> model above (and probably that's why I wrote it down the way I did) using
> the HBase time stamp to store to time dimension.
>
>
> Alternative: Adding the time dimension to the row key
> -----------
>
> That would mean: ID/time -> (attribute -> value)
>
> That would imply to either have copies of all (later) attribute values in
> all (later) rows or to only put deltas and to scan over rows to collect
> attribute values.
>
> Let's assume the latter (for better storage and writing performance).
>
> Wouldn't that mean to rebuild what HBase does? Is there nothing HBase does
> more efficient when performing R1 for example?
>
> I.e: Assume I want to get the latest state of row ID. In that case I would
> need to scan from ID/0 to ID/<now> (or reverse) to fish for all attribute
> values (assuming I don't know all expected attributes beforehand). Is that
> as efficient as an HBase get with max versions 1 and <now> as time stamp?
>
> Thanks,
> Henning
>
>
>
> On 08/21/2013 01:11 PM, Michael Segel wrote:
>
>> I would have to disagree with Lars on this one...
>>
>> Its really a bad design.
>>
>> To your point, your data is temporal in nature. That is to say, time is
>> an element of your data and it should be part of your schema.
>>
>> You have to remember that time is relative.
>>
>> When a row is entered in to HBase, which time is used in the timestamp?
>> The client(s)? The RS?  Unless I am mistaken or the API has changed, you
>> can set up any arbitrary long value to be the timestamp for a given
>> row/cell.
>> Like I said, its relative.
>>
>> Since your data is temporal what is the difference if the event happened
>> at TS xxxxxxxx10 xxxxxxxxx11 (the point is that the TS is different by 1 in
>> the least significant bit)
>> You could be trying to reference the same event.
>>
>> To Lars point, if you make time part of your key, you could end up with
>> hot spots. It depends on your key design. If its the least significant
>> portion of the key, its less of an issue. (clientX | action | TS) would be
>> an example that would sort the data by client, by action type, then by time
>> stamp.  (EPOCH - TS ) would put the most current first.
>>
>> When you try to take a short cut, it usually will bite you in the ass.
>>
>> TANSTAAFL applies!
>>
>> HTH
>>
>> -Mike
>>
>> On Aug 11, 2013, at 12:21 AM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>>
>>  If you want deletes to work correctly you should enable
>>> KEEP_DELETED_CELLS for your column families (I still think that should be
+
Henning Blohm 2013-08-31, 10:09
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB