Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Why does a delete behave like this?


+
Niels Basjes 2013-12-09, 08:47
+
Stack 2013-12-09, 21:30
+
Stack 2013-12-09, 23:27
+
Ted Yu 2013-12-09, 17:55
Copy link to this message
-
Re: Why does a delete behave like this?
This is because by default a delete marker extends all the way back time.
When you set KEEP_DELETED_CELLS for your column family this behavior is fixed. I.e. you get correct timerange query behavior even w.r.t. to deletes.
-- Lars

________________________________
 From: Niels Basjes <[EMAIL PROTECTED]>
To: user <[EMAIL PROTECTED]>
Sent: Monday, December 9, 2013 12:47 AM
Subject: Why does a delete behave like this?
 

Hi,

When I first started learning about HBase I compared the logic of setting
new values to something that is similar to the way a tool like Subversion
works: When you set a new value you don't overwrite the old one, you simply
create a new version.
Just like subversion you can then at a later moment retrieve the old value
that way the situation at an earlier date.

(The only real variation to the SVN model is that HBase only retains the
last N versions of a cell.)

There is however one situation where this comparison really fails: When you
do a delete on a cell.
If you want to retrieve the state of a thing from subversion and in the
current version this thing has been deleted then you can still get it back.
With HBase however if you delete a cell you place a tombstone at a specific
time and as such internally the older values are still present.

But when you try to retrieve such an older value then you still get an
empty result back (i.e. no such cell).
The direct consequence of the currently implemented model is that an
application can never retrieve the correct state of a row at an older
timestamp if a delete on any cell has occurred.

Example:

I create a table with one row:

> create 't1', 'cf'
> put 't1', 'rowid', 'cf:1', 'One', 1000
> put 't1', 'rowid', 'cf:2', 'Two', 2000
> put 't1', 'rowid', 'cf:3', 'Three', 3000
> get 't1', 'rowid' , {TIMERANGE => [0,3500]}

    COLUMN                     CELL
     cf:1                      timestamp=1000, value=One
     cf:2                      timestamp=2000, value=Two
     cf:3                      timestamp=3000, value=Three
    3 row(s) in 0.0150 seconds

Then the delete of a cell at a later timestamp:

> delete 't1', 'rowid', 'cf:1', 4000

Now if I retrieve the row at time 3500 I would find it logical that I would
still see the same values as I would above.
This is however the reality:

> get 't1', 'rowid' , {TIMERANGE => [0,3500]}

    COLUMN                     CELL
     cf:2                      timestamp=2000, value=Two
     cf:3                      timestamp=3000, value=Three
    2 row(s) in 0.0120 seconds
Why has it been designed/implemented like this?
What is the logic behind this model?

--
Best regards / Met vriendelijke groeten,

Niels Basjes
+
Ted Yu 2013-12-10, 04:16
+
lars hofhansl 2013-12-10, 04:49
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB