I was going to post this yesterday, but real work got in the way...
I have to ask... why are you deleting anything from your columns?
The reason I ask is that you're sync'ing an object from an RDBMS to HBase. While HBase allows fields that contain NULL not to exist, your RDBMS doesn't.
Your current logic is to take a field that contains a NULL and then delete the contents from HBase. Why?
No really, Why?
You then have to make your DAO object class more complex when reading from HBase because you need to account for the fields that are NULL.
(In your use case, the DAO is reading/writing against NoSQL and RDBMSs. So its a consistency issue.)
If you just insert the NULL value in the column, you don't have that issue.
Do you waste disk space? Sure. But really, does that matter?
For your specific use case, you may actually want to track (versioning) the values in that column. So that if there was a value and then its gone, you're going to want to know its history.
(For Auditing at a minimum.)
I don't know, its your application. The point is that you are making things more complex and you should think about the alternatives in design and the true cost differences.
On Jul 5, 2012, at 1:28 PM, Ted Yu wrote:
> Take a look at HBASE-3584: Allow atomic put/delete in one call
> It is in 0.94, meaning it is not even in cdh4
> On Thu, Jul 5, 2012 at 11:19 AM, Keith Wyss <[EMAIL PROTECTED]> wrote:
>> My organization has been doing something zany to simulate atomic row
>> operations is HBase.
>> We have a converter-object model for the writables that are populated in
>> an HBase table, and one of the governing assumptions
>> is that if you are dealing with an Object record, you read all the columns
>> that compose it out of HBase or a different data source.
>> When we read lots of data in from a source system that we are trying to
>> mirror with HBase, if a column is null that means that whatever is
>> in HBase for that column is no longer valid. We have simulated what I
>> believe is now called a AtomicRowMutation by using a single Put
>> and populating it with blanks. The downside is the wasted space accrued by
>> the metadata for the blank columns.
>> Atomicity is not of utmost importance to us, but performance is. My
>> approach has been to create a Put and Delete object for a record and
>> populate the Delete with the null columns. Then we call
>> HTable.batch(List<Row>) on a bunch of these. It is my impression that this
>> shouldn't appreciably increase network traffic as the RPC calls will be
>> Has anyone else addressed this problem? Does this seem like a reasonable
>> What sort of performance overhead should I expect?
>> Also, I've seen some Jira tickets about making this an atomic operation in
>> its own right. Is that something that
>> I can expect with CDH3U4?
>> Keith Wyss