Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - What priority for purge filter


Copy link to this message
-
Re: What priority for purge filter
Terry P. 2013-12-11, 20:22
Thanks Keith, wonderful explanation as always, and you are helping ensure
everything goes as expected. Thank you sir!
For minor compactions and partial major compactions, my approach to
"letting everything pass" is:
1. In the init() method (the boolean variable inCorrectScope is declared at
the head of the class and set to false to be safe):

IteratorScope is = env.getIteratorScope();
*if* (is.equals(IteratorScope.*scan*) || env.isFullMajorCompaction())
  inCorrectScope = *true*;

*else*  inCorrectScope = *false*;
2. In the acceptRow() method:

*while* ( rowIterator.hasTop() ) {
  // If not in scan or full major compaction scope, short circuit and
return true
*  if* (!inCorrectScope)
*    return* *true*;
  <otherwise perform the steps to see if the row has the expTs column
family and if the
    purge criteria is met or not from the value in that column>
My main question is just to confirm that I've put the return in the correct
place.

Also, I saw something that surprised me with a scan too. I did a scan with
explicit columns listed, and NOT the expTimestamp column the purge iterator
operates on, and I still see entries. If I include the expTs column the
purge is done on in the explicit list of columns for the scan, entries are
filtered out as they should be.  In our environment and use case
for Accumulo, that shouldn't be an issue, but I can see how that might
confuse someone in other circumstances.  Just curious if there is some way
to "force" it to always run even if the "purge criterion column" is not
included in the scan columns.

Thanks again as always for all the help.

Best regards,
Terry
On Mon, Dec 9, 2013 at 5:45 PM, Keith Turner <[EMAIL PROTECTED]> wrote:

>
>
>
>  On Mon, Dec 9, 2013 at 4:18 PM, Terry P. <[EMAIL PROTECTED]> wrote:
>
>> Thanks Billie and Christopher, sounds like I should have the purge
>> iterator run after the VersioningIterator.
>>
>> Keith, uh oh, I was not aware that not all compactions will see the
>> entire row.  That sounds like it could be bad for my case!  Here is the
>> original thread that you helped me with as background:
>>
>
> Sometimes Accumulo will compact a subset of the data in a tablet.  This
> can happen during a minor compaction and when a major compaction is
> operating on a subset of files.  The rows columns and updates are spread
> across multiple files.   In these cases you may only see a subset of the
> columns in a row.  Also you may not see the latest version.   Scans and
> full major compactions see all data.   You can tell the difference when an
> iterators is initialized.  An IteratorEnvironment is passed into the init
> method.   If the scope is majc and isFullMajorCompaction() is true then you
> know you will see all data (also if the scope is scan).  For minor
> compactions and partial major compactions you may want to just let
> everything pass.
>
>
>>
>>
>> http://mail-archives.apache.org/mod_mbox/accumulo-user/201311.mbox/%3CCAGUtCHryW3RR9PF5BAD+[EMAIL PROTECTED]%3E
>>
>> We only have 10-12 k/v pairs per row -- is that a factor? Can you explain
>> the nuances with respect to when a compaction won't see the entire row?
>>
>> Thanks,
>> Terry
>>
>>
>>
>> On Mon, Dec 9, 2013 at 1:34 PM, Keith Turner <[EMAIL PROTECTED]> wrote:
>>
>>>
>>>
>>>
>>>  On Mon, Dec 9, 2013 at 12:02 PM, Terry P. <[EMAIL PROTECTED]> wrote:
>>>
>>>> Greetings all,
>>>> With Accumulo v1.4.2, we have a purge filter/iterator that extents
>>>> RowFilter and I have a question about what priority it should be
>>>> implemented with. I see the default VersioningIterator runs at priority 20.
>>>>
>>>> Our purge iterator is designed to suppress (scan time) or remove (majc
>>>> or minc compactions) rows based on the value in a column. Is it more
>>>> efficient to run our purge iterator at a higher priority than the
>>>> VersioningIterator, or does it
>>>>
>>>
>>> Are you aware that not all compactions will see the entire row?
>>>
>>>
>>>>  really matter? Our VersioningIterator maxVersions is set to the