Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Mixing Puts and Deletes in a single RPC

Copy link to this message
Re: Mixing Puts and Deletes in a single RPC
Well then, I guess if you really want to save space...

Take your DAO... add a method that takes the fields and writes them to JSON string and converts it to a byte Array.  (AVRO?)
If your field is null, you just don't add it to the output string.

That would take care of all of your overhead issues with the meta data.

Problem solved...

Now you don't have to worry about your meta data storage space since everything is all in a single byte array in a single column.
Of course this doesn't help if you want to filter your queries based on individual values.
(I guess you could write a filter that takes a json string, the field name, value, and comparison type as input and then outputs a T/F.)  This would be server side and would have to be placed on every node, but... it would work.)

Or you could parse out those fields that you need to help identify the record in terms of filtering, and store them in separate columns.  So you store the main record in a column, and then individual fields....

Again, no deletes necessary and of course not a lot of additional overhead.

Without really understanding your use case, how you use the data... its hard to determine what's optimal.

In my first post, while not optimal in terms of storage, its pretty straightforward, simple to implement, and is faster in terms of I/O access....


On Jul 6, 2012, at 10:42 AM, Keith Wyss wrote:

> Hi Michael,
> Thank you for your reply.
> I will answer your questions one by one.
> ---I have to ask... why are you deleting anything from your columns?
> In the event that there is a null value in a source system, we have to
> reflect that somehow and assume that there could be a value in HBase and
> that the source system has recently dropped the value. The options are a
> Put or a Delete. Delete is preferable because it reduces disk space.
> ---Your current logic is to take a field that contains a NULL and then
> delete the contents from HBase. Why?
> No really, Why?
> Our current logic is to write a masking KeyValue containing nothing in the
> value that makes the most recent value empty. Unfortunately this takes up
> a quarter of our space in the grid. If truly deleting can reacquire 25% of
> the HBase space, that sounds awesome, and should speed up reads with less
> KeyValues.
> ---You then have to make your DAO object class more complex when reading
> from HBase because you need to account for the fields that are NULL.
> This is a non-issue for us. It already accounts for nulls.
> ----If you just insert the NULL value in the column, you don't have that
> issue.
> Do you waste disk space? Sure. But really, does that matter?
> Yes. The accompanying metadata eats up a quarter of our space.
> ----For your specific use case, you may actually want to track
> (versioning) the values in that column. So that if there was a value and
> then its gone, you're going to want to know its history.
> I think this is the most applicable downside to propagating deletes
> instead of puts.
> Storing a null in a row store makes a lot of sense to me. You eat a fixed
> amount of space depending on the size of the row. In a sparse column store
> like HBase, the frame and metadata are a real overhead. There's definitely
> a tradeoff between this storage and the benefits of simply treating HBase
> like a row store, and thats why I am curious if other engineers have
> addressed this and are willing to weigh in.
> Thanks for your suggestions. I think the point about versioning is very
> good and I will think long about that.
> Keith
> On 7/6/12 7:51 AM, "Michael Segel" <[EMAIL PROTECTED]> wrote:
>> I was going to post this yesterday, but real work got in the way...
>> I have to ask... why are you deleting anything from your columns?
>> The reason I ask is that you're sync'ing an object from an RDBMS to
>> HBase. While HBase allows fields that contain NULL not to exist, your
>> RDBMS doesn't.