Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - Deleting many rows that match a given criterion


Copy link to this message
-
Re: Deleting many rows that match a given criterion
Terry P. 2013-11-01, 18:35
Thanks Mike. It looks like the AgeOffFilter class would be a good starting
point as a template for my filter: override the logic in the *init *method
as appropriate and put the criteria in the *accept* method.

What I can't figure out is where is the magic to remove entries?  I don't
see anything in the AgeOffFilter class nor the base Filter class.  For my
case, I need to remove all entries for the RowKey that meets the expiration
timestamp column I'll be testing for. So really I'm removing the whole row
(all entries for a given rowkey), not just some entries.

Any chance you could share your code?  Thanks in advance for any help you
can provide.

Kind regards,
Terry

On Thu, Oct 31, 2013 at 6:08 PM, Mike Drob <[EMAIL PROTECTED]> wrote:

> Terry,
>
> Yea, a RowFilter + full compaction takes care of the issue. Note that
> simply setting a RowFilter for scan time and expecting the data to delete
> naturally might not work if your clients set varying fetch columns on their
> scanners.
>
> Mike
>
>
> On Thu, Oct 31, 2013 at 5:11 PM, Terry P. <[EMAIL PROTECTED]> wrote:
>
>> Hi Mike,
>> Did you wind up writing java code to do this?  Did you go with a
>> RowFilter?
>>
>> I have a similar circumstance where I need to delete millions of rows
>> daily and the criteria for deletion is not in the rowkey.
>>
>> Thanks in advance,
>> Terry
>>
>>
>>
>> On Wed, Oct 23, 2013 at 4:21 PM, Mike Drob <[EMAIL PROTECTED]> wrote:
>>
>>> Thanks for the feedback, Aru and Keith.
>>>
>>> I've had some more time to play around with this, and here's some
>>> additional observations.
>>>
>>> My existing process is very slow. I think this is due to each deletemany
>>> command starting up a new scanner and batchwriter, and creating a lot of
>>> rpc overhead. I didn't initially think that it would be a significant
>>> amount of data, but maybe I just had the wrong idea of what "significant"
>>> is in this case.
>>>
>>> I'm not sure the RowDeletingIterator would work in this case because I
>>> do use empty rows for other purposes. The RowFilter at compaction is a
>>> great option, except I had hoped to avoid writing actual java code. Looking
>>> back at this, I might have to bite that bullet.
>>>
>>> Again, thanks both for the suggestions!
>>>
>>> Mike
>>>
>>>
>>> On Tue, Oct 22, 2013 at 12:04 PM, Keith Turner <[EMAIL PROTECTED]> wrote:
>>>
>>>> If its a significant amount of data, you could create a class that
>>>> extends row filter and set it as a compaction iterator.
>>>>
>>>>
>>>> On Tue, Oct 22, 2013 at 11:45 AM, Mike Drob <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> I'm attempting to delete all rows from a table that contain a specific
>>>>> word in the value of a specified column. My current process looks like:
>>>>>
>>>>> accumulo shell -e 'egrep .*EXPRESSION.* -np -t tab -c col' | awk
>>>>> 'BEGIN {print "table tab"}; {print "deletemany -f -np -r" $1}; END {print
>>>>> "exit"}' > rows.out
>>>>> accumulo shell -f rows.out
>>>>>
>>>>> I tried playing around with scan iterators and various options on
>>>>> deletemany and deleterows but wasn't able to find a more straightforward
>>>>> way to do this. Does anybody have any suggestions?
>>>>>
>>>>> Mike
>>>>>
>>>>
>>>>
>>>
>>
>