Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> Scan for keyword

Copy link to this message
Re: Scan for keyword

What you're talking about is pretty common. In fact, it's so common there should probably be an example included in the Acccumulo-examples project for it. To do it requires building another table as a secondary index, as Jason mentioned.  Accumulo doesn't have any special structures just for indexes, it's just another table. Here's how you might go about it:

Assuming using some unique identifier for you row IDs, your table might look something like this:

rowID col fam col qual value
000 displayname joey
000 login jd
000 name joe
001 displayname jd
001 login joe
001 name joey
I would just leave the col qual blank. Then you could build a second table as an index that looks like this:

rowID col fam col qual value
jd displayname 001
jd login 000
joe login 001
joe name 000
joey displayname 000
joey name 001
To build this table, you can simply insert the inverted Mutations into the index table at the same time you're inserting records into your first table.

To query for records in which "joe" appears in any field, you simply scan the entire row identified by "joe" in the index and get all the fields in all records where "joe" appears, thus:

scanner.setRange(new Range("joe"));

To get records where "joe" appears in a specific field, say the name field, alter your scan to include a more specific range:

s.setRange(new Range(new Key(new Text("joe"), new Text("name"), new Text("")), new Key(new Text("joe"), new Text("name\0"), new Text(""))));
That range spans joe name to joe name\0, which includes all column qualifiers up to the next column family.

You can then pull out the column qualifiers from the index to get the rowIDs.

If you want to lookup values from each of those rows, you could then put them in a List and pass them to a BatchScanner. There is code for this in the Indexing subsection of the Table Design section of the manual:

Text term = new Text("mySearchTerm");

HashSet<Text> matchingRows = new HashSet<Text>();

Scanner indexScanner = createScanner("index", auths);
indexScanner.setRange(new Range(term, term));

// we retrieve the matching rowIDs and create a set of ranges
for(Entry<Key,Value> entry : indexScanner)
matchingRows.add(new Text(entry.getValue()));

// now we pass the set of rowIDs to the batch scanner to retrieve them
BatchScanner bscan = conn.createBatchScanner("table", auths, 10);


for(Entry<Key,Value> entry : scan)

This whole process is more complicated than I'd like it to be, but it works pretty well and people have built huge tables and indexes this way. You can get very fancy with what and how you choose to index.

Let us know how this goes for you.

On Nov 23, 2011, at 2:35 PM, Joey Daughtery wrote:

> Aaron
> Thanks for the reply.  I was only able to get data into Accumulo after reviewing the page you provided.
> Lets say for example that I am storing a Name, login, displayName columns as the column family.  And I have inserted Joe, jd, joey as one record and joey, joe, jd for the second record.
> mut.put(new Text("Name"), new Text("joe"), cv, new Value("joe");
> mut.put(new Text("login"), new Text("jd"), cv, new Value("jd");
> mut.put(new Text("DisplayName"), new Text("joey"), cv, new Value("joey");
> write(...)
> mut.put(new Text("Name"), new Text("joey"), cv, new Value("joey");
> mut.put(new Text("login"), new Text("joe"), cv, new Value("joe");
> mut.put(new Text("DisplayName"), new Text("jd"), cv, new Value("jd");
> write(...)
> How would I execute a keyword search for "joe" in an attempt to pull back both records where Joe is the value for Login for one record while "joe" is a value for Name in another?
> The example in the Table Design page shows the search based on the row id.  From my understanding if I provide the rowId, it will limit the search to that row.  But the example on that page is essentially just loading a specific row based on a rowid, not a keyword search.