Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - IndexedDocIterator, indexing approaches


Copy link to this message
-
Re: IndexedDocIterator, indexing approaches
Josh Elser 2013-09-17, 14:57
Rob,

I can try to provide a little more insight here.

If you think about the intersecting iterator(s) as (set) intersecting two
sorted streams of unique IDs, you can easily work in negations and
disjunctions as well. An union iterator is easy to make as you just merge
the two sorted streams of unique IDs. A negation is just an existence check
in a sorted stream. Being able to represent each of these operands in terms
of sorted streams of unique IDs, you can create arbitrary trees of them,
e.g. (A and (B or C)).

As far greater than, less than, and regular expressions, the easiest
approach is to use an inverted index to expand these operators into
discrete terms based on what actually occurs in the data. However, this is
not without it's own pitfalls as well :)

The point about "more expressive query language" is typically supported
through post-filtering, e.g. (A and B and
my-really-complicated-function()). In this case, you can primarily run your
query over the intersection of A and B, and then post-filter out records
which don't satisfy the query. However, this is only really ideal when your
primary search terms are identifying a "reasonable" subset of your data.

As Eric points out, implementing a "full" SQL semantics, in addition to
generalized secondary indexing, is a rather difficult problem; however,
Accumulo does make some things very easy to work with.
On Mon, Sep 16, 2013 at 11:21 AM, Eric Newton <[EMAIL PROTECTED]> wrote:

> Hi Rob,
>
> You're going to have to dig into the source code of the wiki example to
> find out more.  It would be nice if we could update that example and
> provide better documentation, but it is not maintained in its current form.
>
> The wiki example uses jexl to provide a base query language.  AND term
> searches use intersecting iterators, other expressions are handled as basic
> filters.
>
> I think you are getting accumulo schemas, indexing and querying. It's just
> harder than you expected.
>
> There are many teams who are working on better frameworks for query in
> Accumulo; I will let them speak for themselves.
>
> -Eric
>
>
> On Sun, Sep 15, 2013 at 11:05 AM, Rob Tallis <[EMAIL PROTECTED]> wrote:
>
>> Hello
>>
>> The documentation has a couple of sections for indexing - 7.3 talks about
>> pulling back rowids to the client, doing your logic, then using
>> BatchScanners to submit a second query. 7.5 talks about Intersecting
>> Iterators and IndexedDocIterators which do all the work server/cluster side.
>>
>> Getting the cluster to to do all the work seems like a better idea,
>> particularly on massive data sets, since you might hit limits on the client
>> - sounds reasonable, right?
>>
>> So, I've taken a look at IndexedDocIterator and IntersectingIterator and
>> can both get them going with a few noddy examples of AND and NOT querying
>> being done server-side - so far so good, but what about other query
>> operations?
>> The wiki example uses IndexedDocIterator and talks about doing OR
>> queries, regex, and a "much more expressive query language" but I'm not
>> sure how you do this. (I can't find the source it refers to - where do I
>> find it?)
>>
>> Specifically, how would I do AND, OR and NOT queries (or union,
>> intersect, except) in the same query using IndexedDocIterators or
>> intersecting iterators. What about other queries like greater_than,
>> less_than, IN, etc... are these possible?
>>
>> As an aside, I guess using IndexedDocIterators restricts me to having my
>> document in a single row/value (perhaps encoded in JSON or something - is
>> there a recommended method?). IntersectingIterator would return rowIDs
>> which could refer to documents split out by ColF ColQ in the usual way -
>> this would still be a secondary lookup from the client but at least the
>> server has done all the hard work figuring out the rowIDs. Is this a fair
>> assumption?
>>
>> Generally, I'm not "getting" schemas/indexing/querying in Accumulo. Is
>> there a good tutorial on any of this, that perhaps shows some typical