|
|
-
Deleting rows from the Java API
Sean Pines 2012-05-09, 14:31
*< http://accumulo.apache.org/1.4/apidocs/org/apache/accumulo/core/client/admin/TableOperations.html#deleteRows%28java.lang.String,%20org.apache.hadoop.io.Text,%20org.apache.hadoop.io.Text%29>*I have a use case that involves me removing a record from Accumulo based on the Row ID and the Column Family. In the shell, I noticed the command "deletemany" which allows you to specify column family/column qualifier. Is there an equivalent of this in the Java API? In the Java API, I noticed the method: deleteRows(String tableName, org.apache.hadoop.io.Text start, org.apache.hadoop.io.Text end) Delete rows between (start, end] However that only seems to work for deleting a range of RowIDs I would also imagine that deleting rows is costly; is there a better way to approach something like this? The workaround I have for now is to just overwrite the row with an empty string in the value field and ignore any entries that have that. However this just leaves lingering rows for each "delete" and I'd like to avoid that if at all possible. Thanks!
-
Re: Deleting rows from the Java API
Billie J Rinaldi 2012-05-09, 15:00
On Wednesday, May 9, 2012 10:31:46 AM, "Sean Pines" <[EMAIL PROTECTED]> wrote: > I have a use case that involves me removing a record from Accumulo > based on the Row ID and the Column Family. > > In the shell, I noticed the command "deletemany" which allows you to > specify column family/column qualifier. Is there an equivalent of this > in the Java API? > > In the Java API, I noticed the method: > deleteRows(String tableName, org.apache.hadoop.io.Text start, > org.apache.hadoop.io.Text end) > Delete rows between (start, end] > > However that only seems to work for deleting a range of RowIDs > > I would also imagine that deleting rows is costly; is there a better > way to approach something like this? > The workaround I have for now is to just overwrite the row with an > empty string in the value field and ignore any entries that have that. > However this just leaves lingering rows for each "delete" and I'd like > to avoid that if at all possible. > > Thanks!
Connector provides a createBatchDeleter method. You can set the range and columns for BatchDeleter just like you would with a Scanner. This is not an efficient operation (despite the current javadocs for BatchDeleter), but it works well if you're deleting a small number of entries. It scans for the affected key/value pairs, pulls them back to the client, then inserts deletion entries for each. The deleteRows method, on the other hand, is efficient because large ranges can just be dropped. If you want to delete a lot of things and deleteRows won't work for you, consider using a majc scope Filter that filters out what you don't want, compact the table, then remove the filter.
Billie
-
Re: Deleting rows from the Java API
Keith Turner 2012-05-09, 15:13
On Wed, May 9, 2012 at 11:00 AM, Billie J Rinaldi <[EMAIL PROTECTED]> wrote: > On Wednesday, May 9, 2012 10:31:46 AM, "Sean Pines" <[EMAIL PROTECTED]> wrote: >> I have a use case that involves me removing a record from Accumulo >> based on the Row ID and the Column Family. >> >> In the shell, I noticed the command "deletemany" which allows you to >> specify column family/column qualifier. Is there an equivalent of this >> in the Java API? >> >> In the Java API, I noticed the method: >> deleteRows(String tableName, org.apache.hadoop.io.Text start, >> org.apache.hadoop.io.Text end) >> Delete rows between (start, end] >> >> However that only seems to work for deleting a range of RowIDs >> >> I would also imagine that deleting rows is costly; is there a better >> way to approach something like this? >> The workaround I have for now is to just overwrite the row with an >> empty string in the value field and ignore any entries that have that. >> However this just leaves lingering rows for each "delete" and I'd like >> to avoid that if at all possible. >> >> Thanks! > > Connector provides a createBatchDeleter method. You can set the range and columns for BatchDeleter just like you would with a Scanner. This is not an efficient operation (despite the current javadocs for BatchDeleter), but it works well if you're deleting a small number of entries. It scans for the affected key/value pairs, pulls them back to the client, then inserts deletion entries for each. The deleteRows method, on the other hand, is efficient because large ranges can just be dropped. If you want to delete a lot of things and deleteRows won't work for you, consider using a majc scope Filter that filters out what you don't want, compact the table, then remove the filter.
If using the filter option probably would want to put filter at all scopes, flush, compact and then remove the filter. Having the filter at the scan scope prevents user from seeing any of the data immediately. If the filter is only at the majc scope, then users will see the data in some part of the table while the compaction is running. Having the filter at the minc scope will filter out any data in memory when you flush. Having the filter at the majc scope will filter existing data on disk when you compact.
> > Billie
-
Re: Deleting rows from the Java API
David Medinets 2012-05-09, 17:53
On 5/9/12, Billie J Rinaldi <[EMAIL PROTECTED]> wrote: > If you want to delete a > lot of things and deleteRows won't work for you, consider using a majc scope > Filter that filters out what you don't want, compact the table, then remove > the filter.
Is there an example that already does this? Would you consider writing one? Providing simple working java code is so very helpful.
-
Re: Deleting rows from the Java API
Adam Fuchs 2012-05-09, 18:43
I would also add that "small number of entries" in this case is probably measured in the millions or tens of millions. If you're talking about deleting more entries than that then you might start to look into the iterator method.
Cheers, Adam On Wed, May 9, 2012 at 11:01 AM, Billie J Rinaldi <[EMAIL PROTECTED] > wrote:
> On Wednesday, May 9, 2012 10:31:46 AM, "Sean Pines" <[EMAIL PROTECTED]> > wrote: > > I have a use case that involves me removing a record from Accumulo > > based on the Row ID and the Column Family. > > > > In the shell, I noticed the command "deletemany" which allows you to > > specify column family/column qualifier. Is there an equivalent of this > > in the Java API? > > > > In the Java API, I noticed the method: > > deleteRows(String tableName, org.apache.hadoop.io.Text start, > > org.apache.hadoop.io.Text end) > > Delete rows between (start, end] > > > > However that only seems to work for deleting a range of RowIDs > > > > I would also imagine that deleting rows is costly; is there a better > > way to approach something like this? > > The workaround I have for now is to just overwrite the row with an > > empty string in the value field and ignore any entries that have that. > > However this just leaves lingering rows for each "delete" and I'd like > > to avoid that if at all possible. > > > > Thanks! > > Connector provides a createBatchDeleter method. You can set the range and > columns for BatchDeleter just like you would with a Scanner. This is not > an efficient operation (despite the current javadocs for BatchDeleter), but > it works well if you're deleting a small number of entries. It scans for > the affected key/value pairs, pulls them back to the client, then inserts > deletion entries for each. The deleteRows method, on the other hand, is > efficient because large ranges can just be dropped. If you want to delete > a lot of things and deleteRows won't work for you, consider using a majc > scope Filter that filters out what you don't want, compact the table, then > remove the filter. > > Billie >
-
Re: Deleting rows from the Java API
Keith Turner 2012-05-09, 19:39
On Wed, May 9, 2012 at 2:43 PM, Adam Fuchs <[EMAIL PROTECTED]> wrote: > I would also add that "small number of entries" in this case is probably > measured in the millions or tens of millions. If you're talking about > deleting more entries than that then you might start to look into the > iterator method.
Just to clarify, a filter is a type of iterator.
> > Cheers, > Adam > > > On Wed, May 9, 2012 at 11:01 AM, Billie J Rinaldi > <[EMAIL PROTECTED]> wrote: >> >> On Wednesday, May 9, 2012 10:31:46 AM, "Sean Pines" <[EMAIL PROTECTED]> >> wrote: >> > I have a use case that involves me removing a record from Accumulo >> > based on the Row ID and the Column Family. >> > >> > In the shell, I noticed the command "deletemany" which allows you to >> > specify column family/column qualifier. Is there an equivalent of this >> > in the Java API? >> > >> > In the Java API, I noticed the method: >> > deleteRows(String tableName, org.apache.hadoop.io.Text start, >> > org.apache.hadoop.io.Text end) >> > Delete rows between (start, end] >> > >> > However that only seems to work for deleting a range of RowIDs >> > >> > I would also imagine that deleting rows is costly; is there a better >> > way to approach something like this? >> > The workaround I have for now is to just overwrite the row with an >> > empty string in the value field and ignore any entries that have that. >> > However this just leaves lingering rows for each "delete" and I'd like >> > to avoid that if at all possible. >> > >> > Thanks! >> >> Connector provides a createBatchDeleter method. You can set the range and >> columns for BatchDeleter just like you would with a Scanner. This is not an >> efficient operation (despite the current javadocs for BatchDeleter), but it >> works well if you're deleting a small number of entries. It scans for the >> affected key/value pairs, pulls them back to the client, then inserts >> deletion entries for each. The deleteRows method, on the other hand, is >> efficient because large ranges can just be dropped. If you want to delete a >> lot of things and deleteRows won't work for you, consider using a majc scope >> Filter that filters out what you don't want, compact the table, then remove >> the filter. >> >> Billie > >
-
Re: Deleting rows from the Java API
Billie J Rinaldi 2012-05-10, 14:13
On Wednesday, May 9, 2012 1:53:23 PM, "David Medinets" <[EMAIL PROTECTED]> wrote: > On 5/9/12, Billie J Rinaldi <[EMAIL PROTECTED]> wrote: > > If you want to delete a > > lot of things and deleteRows won't work for you, consider using a > > majc scope > > Filter that filters out what you don't want, compact the table, then > > remove > > the filter. > > Is there an example that already does this? Would you consider writing > one? Providing simple working java code is so very helpful.
Consider the following Java code:
package test;
import org.apache.accumulo.core.data.Key; import org.apache.accumulo.core.data.Range; import org.apache.accumulo.core.data.Value; import org.apache.accumulo.core.iterators.Filter; import org.apache.hadoop.io.Text;
public class RangeColumnRemovalFilter extends Filter { private static final Range rangeToRemove = new Range("begin", "end"); private static final Text colfToRemove = new Text("fam2"); @Override public boolean accept(Key k, Value v) { return !(rangeToRemove.contains(k) && k.getColumnFamily().equals(colfToRemove)); } }
Of course, if you wanted to make this more configurable you could pass in the range and column family as parameters. Look at the init method of Filter to see how it receives a parameter and the setNegate static method to see how parameters should be set on IteratorSetting objects.
Jar up the RangeColumnRemovalFilter and drop it in the lib/ext directory. Open the accumulo shell and type the following commands.
root@instanceName> createtable testtable root@instanceName testtable> insert alpha fam1 qual1 val1 root@instanceName testtable> insert alpha fam2 qual2 val2 root@instanceName testtable> insert beta fam1 qual1 val1 root@instanceName testtable> insert beta fam2 qual2a val2 root@instanceName testtable> insert beta fam2 qual2b val2 root@instanceName testtable> insert beta fam3 qual3 val3 root@instanceName testtable> insert gamma fam2 qual2 val2 root@instanceName testtable> insert gamma fam3 qual3 val3 root@instanceName testtable> scan alpha fam1:qual1 [] val1 alpha fam2:qual2 [] val2 beta fam1:qual1 [] val1 beta fam2:qual2a [] val2 beta fam2:qual2b [] val2 beta fam3:qual3 [] val3 gamma fam2:qual2 [] val2 gamma fam3:qual3 [] val3 root@instanceName testtable> setiter -t testtable -scan -majc -minc -p 1 -n rcRemoval -class test.RangeColumnRemovalFilter Filter accepts or rejects each Key/Value pair ----------> set RangeColumnRemovalFilter parameter negate, default false keeps k/v that pass accept method, true rejects k/v that pass accept method: root@instanceName testtable> compact -t testtable -b begin -e end -w 10 10:07:40,148 [shell.Shell] INFO : Compacting table ... 10 10:07:40,903 [shell.Shell] INFO : Compaction of table testtable completed for given range root@instanceName testtable> deleteiter -t testtable -scan -majc -minc -n rcRemoval root@instanceName testtable> scan alpha fam1:qual1 [] val1 alpha fam2:qual2 [] val2 beta fam1:qual1 [] val1 beta fam3:qual3 [] val3 gamma fam2:qual2 [] val2 gamma fam3:qual3 [] val3 root@instanceName testtable> deletetable testtable Table: [testtable] has been deleted. root@instanceName>
The following code shows how to apply the RangeColumnRemovalFilter programmatically. If you jar it up with the filter and drop it in lib/ext, you just have to type "accumulo test.RangeColumnRemovalFilterTest" to run it. You will need to either change the instance name, zookeeper host, username, and password, or change the code to pull them from the command line.
package test;
import java.util.EnumSet; import java.util.Map.Entry;
import org.apache.accumulo.core.client.BatchWriter; import org.apache.accumulo.core.client.Connector; import org.apache.accumulo.core.client.IteratorSetting; import org.apache.accumulo.core.client.ZooKeeperInstance; import org.apache.accumulo.core.data.Key; import org.apache.accumulo.core.data.Mutation; import org.apache.accumulo.core.data.Value; import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope; import org.apache.accumulo.core.security.Authorizations; import org.apache.hadoop.io.Text;
public class RangeColumnRemovalFilterTest { public static void main(String[] args) throws Exception { Connector conn = new ZooKeeperInstance("instanceName", "zookeeperHost").getConnector("user", "pass"); conn.tableOperations().create("tableName"); BatchWriter bw = conn.createBatchWriter("tableName", 200000l, 1000, 1); Mutation m = new Mutation("alpha"); // before "begin" m.put("fam1", "qual1", "val1"); m.put("fam2", "qual2", "val2"); bw.addMutation(m); m = new Mutation("beta"); // between "begin" and "end m.put("fam1", "qual1", "val1"); m.put("fam2", "qual2a", "val2"); m.put("fam2", "qual2b", "val2"); m.put("fam3", "qual3", "val3"); bw.addMutation(m); m = new Mutation("gamma"); // after "end" m.put("fam2", "qual2", "val2"); m.put("fam3", "qual3", "val3"); bw.addMutation(m); bw.close(); System.out.println("Before:"); for (Entry<Key,Value> entry : conn.createScanner("tableName", new Authorizations())) { System.out.println(entry); } IteratorSetting is = new IteratorSetting(1, "rcRemoval", RangeColumnRemovalFilter.class); conn.tableOperations().attachIterator("tableName", is); conn.tableOperations().compact("tableName", new Text("begin"), new Text("end"), true, true); conn.tableOperations().removeIterator("tableName", "rcRemoval", EnumSet.allOf(IteratorScope.class)); System.out.println("\nAfter:"); for (Entry<Key,Value> entry : conn.createScanner("tableName", new Authorizations())) { System.out.println(entry); } conn.tableOperations().delete("tableName"); // remove the table so we can run the test again } }
|
|