Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - Deleting many rows that match a given criterion


Copy link to this message
-
Re: Deleting many rows that match a given criterion
Terry P. 2013-10-31, 21:11
Hi Mike,
Did you wind up writing java code to do this?  Did you go with a RowFilter?

I have a similar circumstance where I need to delete millions of rows daily
and the criteria for deletion is not in the rowkey.

Thanks in advance,
Terry

On Wed, Oct 23, 2013 at 4:21 PM, Mike Drob <[EMAIL PROTECTED]> wrote:

> Thanks for the feedback, Aru and Keith.
>
> I've had some more time to play around with this, and here's some
> additional observations.
>
> My existing process is very slow. I think this is due to each deletemany
> command starting up a new scanner and batchwriter, and creating a lot of
> rpc overhead. I didn't initially think that it would be a significant
> amount of data, but maybe I just had the wrong idea of what "significant"
> is in this case.
>
> I'm not sure the RowDeletingIterator would work in this case because I do
> use empty rows for other purposes. The RowFilter at compaction is a great
> option, except I had hoped to avoid writing actual java code. Looking back
> at this, I might have to bite that bullet.
>
> Again, thanks both for the suggestions!
>
> Mike
>
>
> On Tue, Oct 22, 2013 at 12:04 PM, Keith Turner <[EMAIL PROTECTED]> wrote:
>
>> If its a significant amount of data, you could create a class that
>> extends row filter and set it as a compaction iterator.
>>
>>
>> On Tue, Oct 22, 2013 at 11:45 AM, Mike Drob <[EMAIL PROTECTED]> wrote:
>>
>>> I'm attempting to delete all rows from a table that contain a specific
>>> word in the value of a specified column. My current process looks like:
>>>
>>> accumulo shell -e 'egrep .*EXPRESSION.* -np -t tab -c col' | awk 'BEGIN
>>> {print "table tab"}; {print "deletemany -f -np -r" $1}; END {print "exit"}'
>>> > rows.out
>>> accumulo shell -f rows.out
>>>
>>> I tried playing around with scan iterators and various options on
>>> deletemany and deleterows but wasn't able to find a more straightforward
>>> way to do this. Does anybody have any suggestions?
>>>
>>> Mike
>>>
>>
>>
>