Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - IndexedDocIterator, indexing approaches


Copy link to this message
-
Re: IndexedDocIterator, indexing approaches
Eric Newton 2013-09-16, 15:21
Hi Rob,

You're going to have to dig into the source code of the wiki example to
find out more.  It would be nice if we could update that example and
provide better documentation, but it is not maintained in its current form.

The wiki example uses jexl to provide a base query language.  AND term
searches use intersecting iterators, other expressions are handled as basic
filters.

I think you are getting accumulo schemas, indexing and querying. It's just
harder than you expected.

There are many teams who are working on better frameworks for query in
Accumulo; I will let them speak for themselves.

-Eric
On Sun, Sep 15, 2013 at 11:05 AM, Rob Tallis <[EMAIL PROTECTED]> wrote:

> Hello
>
> The documentation has a couple of sections for indexing - 7.3 talks about
> pulling back rowids to the client, doing your logic, then using
> BatchScanners to submit a second query. 7.5 talks about Intersecting
> Iterators and IndexedDocIterators which do all the work server/cluster side.
>
> Getting the cluster to to do all the work seems like a better idea,
> particularly on massive data sets, since you might hit limits on the client
> - sounds reasonable, right?
>
> So, I've taken a look at IndexedDocIterator and IntersectingIterator and
> can both get them going with a few noddy examples of AND and NOT querying
> being done server-side - so far so good, but what about other query
> operations?
> The wiki example uses IndexedDocIterator and talks about doing OR queries,
> regex, and a "much more expressive query language" but I'm not sure how you
> do this. (I can't find the source it refers to - where do I find it?)
>
> Specifically, how would I do AND, OR and NOT queries (or union, intersect,
> except) in the same query using IndexedDocIterators or intersecting
> iterators. What about other queries like greater_than, less_than, IN,
> etc... are these possible?
>
> As an aside, I guess using IndexedDocIterators restricts me to having my
> document in a single row/value (perhaps encoded in JSON or something - is
> there a recommended method?). IntersectingIterator would return rowIDs
> which could refer to documents split out by ColF ColQ in the usual way -
> this would still be a secondary lookup from the client but at least the
> server has done all the hard work figuring out the rowIDs. Is this a fair
> assumption?
>
> Generally, I'm not "getting" schemas/indexing/querying in Accumulo. Is
> there a good tutorial on any of this, that perhaps shows some typical
> SQL-like things I might want to do and what is/isn't possible in Accumulo
> and how I do it?
>
> Cheers,
> Rob Tallis
>
>
>