Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Sporadic memstore slowness for Read Heavy workloads


Copy link to this message
-
Re: Sporadic memstore slowness for Read Heavy workloads
I see you figured it out. I should read all email before I sent my last reply.

________________________________
 From: Varun Sharma <[EMAIL PROTECTED]>
To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; lars hofhansl <[EMAIL PROTECTED]>
Sent: Tuesday, January 28, 2014 9:43 AM
Subject: Re: Sporadic memstore slowness for Read Heavy workloads
 
Ohk I think I understand this better now. So the order will actually be, something like this, at step #3

(ROW, <DELETE>, T=2)
(ROW, COL1, T=3)
(ROW, COL1, T=1)  - filtered

(ROW, COL2, T=3)
(ROW, COL2, T=1)  - filtered
(ROW, COL3, T=3)
(ROW, COL3, T=1)  - filtered

The ScanDeleteTracker class would simply filter out columns which have a timestamp < 2.

Varun

On Tue, Jan 28, 2014 at 9:04 AM, Varun Sharma <[EMAIL PROTECTED]> wrote:

Lexicographically, (ROW, COL2, T=3) should come after (ROW, COL1, T=1) because COL2 > COL1 lexicographically. However in the above example, it comes before the delete marker and hence before (ROW, COL1, T=1) which is wrong, no ?
>
>
>
>On Tue, Jan 28, 2014 at 9:01 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
>
>bq. Now, clearly there will be columns above the delete marker which are
>>
>>smaller than the ones below it.
>>
>>This is where closer look is needed. Part of the confusion arises from
>>usage of > and < in your example.
>>(ROW, COL2, T=3) would sort before (ROW, COL1, T=1).
>>
>>Here, in terms of sort order, 'above' means before. 'below it' would mean
>>after. So 'smaller' would mean before.
>>
>>Cheers
>>
>>
>>
>>On Tue, Jan 28, 2014 at 8:47 AM, Varun Sharma <[EMAIL PROTECTED]> wrote:
>>
>>> Hi Ted,
>>>
>>> Not satisfied with your answer, the document you sent does not talk about
>>> Delete ColumnFamily marker sort order. For the delete family marker to
>>> work, it has to mask *all* columns of a family. Hence it has to be above
>>> all the older columns. All the new columns must come above this column
>>> family delete marker. Now, clearly there will be columns above the delete
>>> marker which are smaller than the ones below it.
>>>
>>> The document talks nothing about delete marker order, could you answer the
>>> question by looking through the example ?
>>>
>>> Varun
>>>
>>>
>>> On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
>>>
>>> > Varun:
>>> > Take a look at http://hbase.apache.org/book.html#dm.sort
>>> >
>>> > There's no contradiction.
>>> >
>>> > Cheers
>>> >
>>> > On Jan 27, 2014, at 11:40 PM, Varun Sharma <[EMAIL PROTECTED]> wrote:
>>> >
>>> > > Actually, I now have another question because of the way our work load
>>> is
>>> > > structured. We use a wide schema and each time we write, we delete the
>>> > > entire row and write a fresh set of columns - we want to make sure no
>>> old
>>> > > columns survive. So, I just want to see if my picture of the memstore
>>> at
>>> > > this point is correct or not. My understanding is that Memstore is
>>> > > basically a skip list of keyvalues and compares the values using
>>> KeyValue
>>> > > comparator
>>> > >
>>> > > 1) *T=1, *We write 3 columns for "ROW". So memstore has:
>>> > >
>>> > > (ROW, COL1, T=1)
>>> > > (ROW, COL2, T=1)
>>> > > (ROW, COL3, T=1)
>>> > >
>>> > > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So
>>> > > memstore has - my understanding is that we do not delete in the
>>> memstore
>>> > > but only add markers:
>>> > >
>>> > > (ROW, <DELETE>, T=2)
>>> > > (ROW, COL1, T=1)
>>> > > (ROW, COL2, T=1)
>>> > > (ROW, COL3, T=1)
>>> > >
>>> > > 3) Now we write our new fresh row for *T=3* - it should get inserted
>>> > above
>>> > > the delete.
>>> > >
>>> > > (ROW, COL1, T=3)
>>> > > (ROW, COL2, T=3)
>>> > > (ROW, COL3, T=3)
>>> > > (ROW, <DELETE>, T=2)
>>> > > (ROW, COL1, T=1)
>>> > > (ROW, COL2, T=1)
>>> > > (ROW, COL3, T=1)
>>> > >
>>> > > This is the ideal scenario for the data to be correctly reflected.
>>> > >
>>> > > (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and