Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo, mail # user - strategies beyond intersecting iterators?


+
Sukant Hajra 2012-06-28, 20:49
+
Sukant Hajra 2012-07-02, 03:57
Copy link to this message
-
Re: strategies beyond intersecting iterators?
William Slacum 2012-07-02, 04:23
The you can think of the Intersecting (and Or) iterator as a tree of
merging keys.

So, let's assume we have the following index in a given partition. The
partition will have the row "partitionN".

partitionN Bill: 1
partitionN Bill: 2
partitionN Bill: 3
partitionN Josh: 3
partitionN Josh: 4
partitionN Josh: 5
partitionN Sukant: 0
partitionN Sukant: 3
partitionN Sukant: 6

If I wanted to query for all documents that contained "Bill", "Josh" and
"Sukant", I'd set up an IntersectingIterator that three term sources. The
term sources would be created to look at one of {"Bill", "Josh", "Sukant"}
for their column family values. The column qualifiers contained document
IDs for documents that contain the given term. This yields a set up where,
for a given term, we have a sorted list of document IDs.

To give a bit of a visualization, you can think of this structure in tree
form:
              Intersection
         /           |            \
   "Sukant"   "Bill"        "Josh"
      [0,          [1,            [3,
       3,           2,             4,
       6]           3]             5]
On our first pass, the IntersectingIterator will note that its children
point to the document IDs 0, 1 and 3. Since each list of doc IDs is sorted,
we can deduce that the earliest doc ID that could be a potential match is
3. So, it will seek the term sources for "Sukant" and "Bill" to at least
the key {row: "partitionN", colf: <term>, colq: "3"}. On the next pass,
we'll note that each term source is pointing to doc ID 3. This means we've
found an intersection, so the top level IntersectingIterator will return
docID 3.

When the session requests the next matching docID, the iterator will
advance each iterator by calling next(). The IntersectingIterator now sees
its children are all positioned at docIDs [6, null, 5] (the `null` value
arises because the "Bill" term source doesn't have a key beyond {row:
"partitionN", colf: "Bill", colq: "3"}. This state means that the
intersection is done, because one of the term sources has exhausted its
possible values, so there's no doc ID that will occur in all three lists.

On Sun, Jul 1, 2012 at 11:57 PM, Sukant Hajra <[EMAIL PROTECTED]>wrote:

> Excerpts from Sukant Hajra's message of Thu Jun 28 15:49:11 -0500 2012:
> >
> > The Accumulo documentation alludes to the problem a little:
> >
> >     If the results are unordered this is quite effective as the first
> results
> >     to arrive are as good as any others to the user.
> >
> > In our case, order matters because we want the last results without
> pulling in
> > everything.
>
> Actually, I was just thinking about this a little.  I don't know if this is
> specified in the documentation, but is there /any/ reliable (deterministic)
> ordering for the values returned by intersecting iterators?
>
> If there is, would it be horribly ill-advised to rely on this ordering for
> application logic if we got clever with our schema?
>
> Also, if someone could reply with the exact algorithm for this ordering, it
> would help put less burden on us to reverse engineer and/or read the source
> code correctly.
>
> Thanks for your help,
> Sukant
>
+
Sukant Hajra 2012-07-02, 04:43
+
Keith Turner 2012-07-02, 09:55
+
William Slacum 2012-06-28, 21:04
+
Sukant Hajra 2012-06-29, 21:27
+
William Slacum 2012-07-01, 22:18
+
Josh Elser 2012-07-01, 23:27