Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo, mail # user - strategies beyond intersecting iterators?


+
Sukant Hajra 2012-06-28, 20:49
+
Sukant Hajra 2012-07-02, 03:57
+
William Slacum 2012-07-02, 04:23
+
Sukant Hajra 2012-07-02, 04:43
+
Keith Turner 2012-07-02, 09:55
Copy link to this message
-
Re: strategies beyond intersecting iterators?
William Slacum 2012-06-28, 21:04
You're pretty much on the spot regarding two aspects about the current
IntersectingIterator:

1- It's not really extensible (there are hooks for building doc IDs,
but you still need the same `partition term: docId` key structure)
2- Its main strength is that it can do the merges of sorted lists of
doc IDs based on equality expressions (ie, `author=="bob" and
day=="20120627"`)

Fortunately, the logic isn't very complicated for re-creating the
merging stuff. Personally, I think it's easy enough to separate the
logic of joining N streams of iterator results from the actual
scanning. Unfortunately, this would be left up to you to do at the
moment :)

You could do range searches by consuming sets of values and sorting
all of the docIds in that range by throwing them into a TreeSet. That
would let you emit doc IDs in a globally sorted order for the given
range of terms. This can get problematic if the range ends up being
very large because your iterator stack may periodically be destroyed
and rebuilt.

On Thu, Jun 28, 2012 at 1:49 PM, Sukant Hajra <[EMAIL PROTECTED]> wrote:
> We're in a position right now, where we have a change list (like a transaction
> log) and we'd like to index the changes by author, but a typical query is:
>
>    Show the last n changes for author "Foo Bar"
>
> or
>
>    Show changes after Jan. 1st, 2012 for author "Foo Bar"
>
> Certainly, we can denormalize our data to facilitate this lookup.  But the idea
> of using intersecting iterators seems intriguing (to get a modicum of
> data-local server-side joining), but our ideas for shoe-horning the query into
> intersecting iterators seems really wonky or half-baked.  Largely, we're
> running into the restriction that intersecting iterators are based upon the
> product of a boolean conjunctive statements about term equality.  What we'd
> really like is a little more range-based.  The Accumulo documentation alludes
> to the problem a little:
>
>    If the results are unordered this is quite effective as the first results
>    to arrive are as good as any others to the user.
>
> In our case, order matters because we want the last results without pulling in
> everything.
>
> We looked at the code for intersecting iterators a little, and noticed that
> there's an inheritance design, but we're not convinced that it's really
> "designed for extension" and if it is, we're not sure if it can be extended to
> meet our needs gracefully.  If it can, we're really interested in any
> suggestions or prior work.
>
> Otherwise, we're open to the idea that there's Accumulo features we're just not
> aware of beyond intersecting iterators that are a better fit.
>
> It would be wonderful to have a technique to hedge against over-denormalizing
> our data for every variant of query we have to support.
>
> Thanks for your help,
> Sukant
+
Sukant Hajra 2012-06-29, 21:27
+
William Slacum 2012-07-01, 22:18
+
Josh Elser 2012-07-01, 23:27