Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Re: Get on a row with multiple columns


Copy link to this message
-
Re: Get on a row with multiple columns
Ted 2013-02-09, 06:29
How often do you need to perform such delete operation ?

Is there way to utilize ttl so that you can avoid deletions ?

Pardon me for not knowing your use case very well.

On Feb 8, 2013, at 10:16 PM, Varun Sharma <[EMAIL PROTECTED]> wrote:

> Using hbase 0.94.3. Tried that too, ran into performance issues with having
> to retrieve the entire row first (this was getting slow when one particular
> row is hammered) since row can be big (few megs, some times 10s of megs)
> and then finding the columns and then doing a delete.
>
> To me, it looks like the current implementation of deleteColumn is
> suboptimal because of the 300 gets vs doing 1.
>
> Thanks
> Varun
>
> On Fri, Feb 8, 2013 at 10:09 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>
>> Which HBase version are you using ?
>>
>> Is there a way to place 10 delete markers from application side instead of
>> 300 ?
>>
>> Thanks
>>
>> On Fri, Feb 8, 2013 at 10:05 PM, Varun Sharma <[EMAIL PROTECTED]> wrote:
>>
>>> We are given a set of 300 columns to delete. I tested two cases:
>>>
>>> 1) deleteColumns() - with the 's'
>>>
>>> This function simply adds delete markers for 300 columns, in our case,
>>> typically only a fraction of these columns are actually present - 10.
>> After
>>> starting to use deleteColumns, we starting seeing a drop in cluster wide
>>> random read performance - 90th percentile latency worsened, so did 99th
>>> probably because of having to traverse delete markers. I attribute this
>> to
>>> profusion of delete markers in the cluster. Major compactions slowed down
>>> by almost 50 percent probably because of having to clean out
>> significantly
>>> more delete markers.
>>>
>>> 2) deleteColumn()
>>>
>>> Ended up with untolerable 15 second calls, which clogged all the
>> handlers.
>>> Making the cluster pretty much unresponsive.
>>>
>>> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>>>
>>>> For the 300 column deletes, can you show us how the Delete(s) are
>>>> constructed ?
>>>>
>>>> Do you use this method ?
>>>>
>>>>  public Delete deleteColumns(byte [] family, byte [] qualifier) {
>>>> Thanks
>>>>
>>>> On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <[EMAIL PROTECTED]>
>>> wrote:
>>>>
>>>>> So a Get call with multiple columns on a single row should be much
>>> faster
>>>>> than independent Get(s) on each of those columns for that row. I am
>>>>> basically seeing severely poor performance (~ 15 seconds) for certain
>>>>> deleteColumn() calls and I am seeing that there is a
>>>>> prepareDeleteTimestamps() function in HRegion.java which first tries
>> to
>>>>> locate the column by doing individual gets on each column you want to
>>>>> delete (I am doing 300 column deletes). Now, I think this should
>> ideall
>>>> by
>>>>> 1 get call with the batch of 300 columns so that one scan can
>> retrieve
>>>> the
>>>>> columns and the columns that are found, are indeed deleted.
>>>>>
>>>>> Before I try this fix, I wanted to get an opinion if it will make a
>>>>> difference to batch the get() and it seems from your answer, it
>> should.
>>>>>
>>>>> On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <[EMAIL PROTECTED]>
>>> wrote:
>>>>>
>>>>>> Everything is stored as a KeyValue in HBase.
>>>>>> The Key part of a KeyValue contains the row key, column family,
>>> column
>>>>>> name, and timestamp in that order.
>>>>>> Each column family has it's own store and store files.
>>>>>>
>>>>>> So in a nutshell a get is executed by starting a scan at the row
>> key
>>>>>> (which is a prefix of the key) in each store (CF) and then scanning
>>>>> forward
>>>>>> in each store until the next row key is reached. (in reality it is
>> a
>>>> bit
>>>>>> more complicated due to multiple versions, skipping columns, etc)
>>>>>>
>>>>>>
>>>>>> -- Lars
>>>>>> ________________________________
>>>>>> From: Varun Sharma <[EMAIL PROTECTED]>
>>>>>> To: [EMAIL PROTECTED]
>>>>>> Sent: Friday, February 8, 2013 9:22 PM
>>>>>> Subject: Re: Get on a row with multiple columns