Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Delete client API.


Copy link to this message
-
Re: Delete client API.
M.C., why would option B be superior to simply letting the native timestamps in HBase do what they were meant to do, and then storing your app-level logical timestamps in the cell itself along with the data? The (admittedly more correct) behavior you want is already the normal behavior when you're not setting application-defined timestamps.

In other words: HBase already has a timestamp that behaves as you describe, and only when you intentionally use it for another purpose does the behavior become non-intuitive. And, other things will become non-intuitive too, like replication.

In the FB messaging case, if I'm not mistaken, the official timestamp value is in use for something that isn't a timestamp at all (message ids, or something along those lines). So in that case, it would make sense that you'd want to also have another timestamp. I'm tempted to assert that that's an unusual use of the timestamp field, but then again, if the biggest use case of a product does something, it's hardly "unusual". :)

At the very least, since it would add overhead to every cell, this should be an opt-in behavior (the ability to say, "I'm setting my own timestamps, so HBase should also keep its own real timestamp"). But then again, what's the argument for doing that rather than storing the timestamps in your cell value? Is it the added abilities the API gives you around time ranges?

Ian

On Jan 18, 2012, at 1:51 AM, M. C. Srivas wrote:

On Tue, Jan 17, 2012 at 8:56 PM, lars hofhansl <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:

The memstoreTS is used for visibility during an intra-row transaction.
Are you proposing to do this only if the deletes/puts did not use the
current time?

The ability to define timestamps for all operations is crucial to HBase.
o It ensures that HTable.batch works correctly (which reorders Deletes
w.r.t. to Puts at the Region Server).
o It ensures that replication works correctly.
o many other scenarios

If you do not use application defined timestamp the current time is used
and everything works as expected.
If you use application defined timestamps you are asking for a delete to
be either in the future or the past, and you have to understand what that
means.
Maybe we should document the behavior better.
I guess I am saying that I *do* understand the current "delete with TS"
behavior, and I find the current implementation  unstable and
non-deterministic.  Documenting it more thoroughly does not make it less
quirky or more stable.  I propose fixing it along the lines suggested in
option B.  Karthik seems to agree.

-- Lars
----- Original Message -----
From: Karthik Ranganathan <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
To: "[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>; lars hofhansl <
[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
Cc:
Sent: Tuesday, January 17, 2012 3:27 PM
Subject: Re: Delete client API.
@Srivas - totally agree that B is the correct thing to do.

One way we have talked about implementing this is using the memstore ts.
Every insert of a KV into the memstore is given a memstore-ts. These are
persisted only till they are needed (to ensure read atomicity for
scanners) and then that value is zeroed out on a subsequent compaction
(saves space). If we retained the memstore-ts even beyond these
compactions, we could get a deterministic order for the puts and deletes
(first insert ts < del ts < second insert ts).

Thanks
Karthik
On 1/17/12 2:14 PM, "M. C. Srivas" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:

On Tue, Jan 17, 2012 at 10:07 AM, lars hofhansl <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
wrote:

Yeah, it's confusing if one expects it to work like in a relational
database.
You can even do worse. If you by accident place a delete in the future
all
current inserts will be hidden until the next major compaction. :)
I got confused about this myself just recently (see my mail on the
dev-list).
In the end this is a pretty powerful feature and core to how HBase works
(not saying that is not confusing though).
If one keeps the following two points in mind it makes more sense:
1. Delete just sets a tomb stone marker at a specific TS (marking
everything older as deleted).
2. Everything is versioned, if no version is specified the current time
(at the regionserver) is used.

In your example1 below t3 > 6, hence the insert is hidden.
In example2 both delete and insert TS are 6, hence the insert is hidden.
Lets consider my example2 for a little longer. Sequence of events

 1.  ins  val1  with TS=6 set by client
 2.  del  entire row at TS=6 set by client
 3.  ins  val2  with TS=6  set by client
 4.  read row

The row returns nothing even though the insert at step 3 happened after
the
delete at step 2. (step 2 masks even future inserts)

Now, the same sequence with a compaction thrown in the middle:

 1.  ins  val1  with TS=6 set by client
 2.  del  entire row at TS=6 set by client
 3.  ---- table is compacted -----
 4.  ins  val2  with TS=6  set by client
 5.  read row

The row returns val2.  (the delete at step2 got lost due to compaction).

So we have different results depending upon whether an internal
re-organization (like a compaction) happened or not. If we want both
sequences to behave exactly the same, then we need to first choose what is
the proper (and deterministic) behavior.

A.  if we think that the first sequence is the correct one, then the
delete
at step 2 needs to be preserved forever.

or,

B. if we think that the second sequence is the correct behavior (ie, a
read
always produces the same results independent of compaction), then the
record needs a second "internal TS" field to allow the RS to distinguish
the real sequence of events, and not rely upon the TS field which is
settable by the client.

My opinion:

We should do B.  It is normal for someone to write code that says  "if old
exists, delete it;  add new". A subsequent read should always