Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> IndexedDocIterator, indexing approaches


Copy link to this message
-
IndexedDocIterator, indexing approaches
Hello

The documentation has a couple of sections for indexing - 7.3 talks about
pulling back rowids to the client, doing your logic, then using
BatchScanners to submit a second query. 7.5 talks about Intersecting
Iterators and IndexedDocIterators which do all the work server/cluster side.

Getting the cluster to to do all the work seems like a better idea,
particularly on massive data sets, since you might hit limits on the client
- sounds reasonable, right?

So, I've taken a look at IndexedDocIterator and IntersectingIterator and
can both get them going with a few noddy examples of AND and NOT querying
being done server-side - so far so good, but what about other query
operations?
The wiki example uses IndexedDocIterator and talks about doing OR queries,
regex, and a "much more expressive query language" but I'm not sure how you
do this. (I can't find the source it refers to - where do I find it?)

Specifically, how would I do AND, OR and NOT queries (or union, intersect,
except) in the same query using IndexedDocIterators or intersecting
iterators. What about other queries like greater_than, less_than, IN,
etc... are these possible?

As an aside, I guess using IndexedDocIterators restricts me to having my
document in a single row/value (perhaps encoded in JSON or something - is
there a recommended method?). IntersectingIterator would return rowIDs
which could refer to documents split out by ColF ColQ in the usual way -
this would still be a secondary lookup from the client but at least the
server has done all the hard work figuring out the rowIDs. Is this a fair
assumption?

Generally, I'm not "getting" schemas/indexing/querying in Accumulo. Is
there a good tutorial on any of this, that perhaps shows some typical
SQL-like things I might want to do and what is/isn't possible in Accumulo
and how I do it?

Cheers,
Rob Tallis
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB