No, you're absolutely right, but if the webserver is OOME'ing, then it's
obviously doing something :). You could try configuring it to write out
a heapdump when it OOMEs and use jhat, jvisualvm or similar to analyze
what was actually in the heap.
Let me expand a little more for you. The global index (forward and
reverse) attempt to determine the search space for the query. For
queries over very selective data, it will identify records in a row in
the doc-partitioned table using the serialized protocol buffer in the
Value. These records can be tested directly instead of having to also
"open" the index inside of the doc-partitioned table. For very broad
queries or intersections over very common terms, the global index will
identifies the rows necessary to be searched in the doc-partitioned table.
The index in the doc-partitioned table is where the magic happens. A
"tree" (using that term very loosely for the given implementation) is
constructed for each field and term pair in each candidate row. At this
point, merged, sorted reads over each field and term pair in that row
are scanned trying to find docids which satisfy the "tree".
If you think of the docids as integers (they're not actually integers in
practice, but that's irrelevant), each field and term pair creates a
list of docids. For every AND in the query, you're intersecting the two
lists of docids into a single sorted list, and for every OR you're
merging those two lists into single sorted list.
This is trivial when you are simply intersecting two terms (e.g. "foo"
AND "bar"), but applies generally for arbitrary subtrees, e.g. ("foo"
AND ("bar" OR "bat" OR "baz")). Treating each subtree as a sorted list
of docids is your recursive definition.
On 06/10/2013 09:46 PM, Frank Smith wrote:
> Ok, thanks for these insights, as I have mentioned, I am tweaking and
> changing things for my own purpose, and I am trying to understand just
> how much my tweaking might have unintended consequences.
> To extend upon your thoughts for why there is a problem, I need to
> look in the web services to make sure it isn't creating objects from
> the results of the search scan, because it should return no results.
> That is where I am still concerned, shouldn't the scan iterator not
> pass anything through for something with no results? Again, I need to
> look harder myself, but I am more trying to understand how the
> iterators notionally behave with the this table structure.
> Date: Sun, 9 Jun 2013 23:18:43 -0400
> Subject: Re: Wikisearch
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]
> The forward and reverse index are very important, yes, with the
> in-partition "field index" being even more important.
> Yes to full table scans being undesirable and probably useless in the
> scope of the wikisearch as it should index most everything and thus
> there is nothing extra to be gleaned.
> I forget exactly how it was implemented, but tokens will appear in the
> global indices and the doc partitioned table.
> The most likely reason for the oome is that the trivial web service
> included attempts to suck all results into memory. There's nothing
> inherently wrong with scanning all records in Accumulo, but the
> webserver will easily fall over.
> On Jun 9, 2013 11:08 PM, "Frank Smith" <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
> Appreciate everyone's help on the file storage question, but I was
> also looking at Josh's response to Thomas Jackson, and do I
> understand him correctly that the scan of the Index (and likely
> the ReverseIndex) table are really the key part of the search
> query, and the full table scan isn't really useful for much
> (because all of the tokens should go in the Index tables)?
> So if I understand correctly, the partitioned main table is where
> documents and tokens get written, and then a combiner feeds the