Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> strategies beyond intersecting iterators?

Sukant Hajra 2012-06-28, 20:49
Sukant Hajra 2012-07-02, 03:57
William Slacum 2012-07-02, 04:23
Sukant Hajra 2012-07-02, 04:43
Keith Turner 2012-07-02, 09:55
William Slacum 2012-06-28, 21:04
Sukant Hajra 2012-06-29, 21:27
William Slacum 2012-07-01, 22:18
Copy link to this message
Re: strategies beyond intersecting iterators?
Since I had started a response, but Bill beat me to it, let me reiterate.

The tear-down is more for assuring responsiveness when multiple scans
are happening at one time. There's a buffer between TabletServer(s) and
the client which (if memory serves) it's filled, the scan session is a
candidate to be torn down, and later recreated.

To avoid duplicate work by your Accumulo iterators, the last key the
iterators returned is maintained by Accumulo.

For example, if you started a scan with a Range:

(-inf, +inf)

Say you scanned 2000/10000 keys in a table of monotonically increasing
Keys where only the row is populated. The buffer was filled, the
iterators torn down, and re-created some amount of time later. Instead
of getting the (-inf, +inf) range again, you would then get the range:

(2000, +inf)

Meaning, the initial infinite start key would be replaced with a start
key which was the last key your previous scan returned, non-inclusive.

In short, it's good practice to try to keep Accumulo iterators from
holding on to state in memory, otherwise you may get stuck creating the
same in-memory members on your iterators repeatedly. See ACCUMULO-625
for some thoughts about trying to avoid this lost-state issue.

- Josh

On 07/01/2012 05:18 PM, William Slacum wrote:
> By iterator stack I am referring to the Accumulo iterators. Resource
> sharing among scan sessions is implemented by destroying a user scan
> session and eventually recreating the iterator stack. The new stack is
> then seek'd to the last key returned by the entire stack. If you were
> holding some state, such as a set of keys, it would be rebuilt every
> time the stack is created.
> On Jul 1, 2012 5:55 PM, "Sukant Hajra" <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
>     Excerpts from William Slacum's message of Thu Jun 28 16:04:32
>     -0500 2012:
>     >
>     > You're pretty much on the spot regarding two aspects about the
>     current
>     > IntersectingIterator:
>     >
>     > 1- It's not really extensible (there are hooks for building doc IDs,
>     > but you still need the same `partition term: docId` key structure)
>     > 2- Its main strength is that it can do the merges of sorted lists of
>     > doc IDs based on equality expressions (ie, `author=="bob" and
>     > day=="20120627"`)
>     >
>     > Fortunately, the logic isn't very complicated for re-creating the
>     > merging stuff. Personally, I think it's easy enough to separate the
>     > logic of joining N streams of iterator results from the actual
>     > scanning. Unfortunately, this would be left up to you to do at the
>     > moment :)
>     >
>     > You could do range searches by consuming sets of values and sorting
>     > all of the docIds in that range by throwing them into a TreeSet.
>     That
>     > would let you emit doc IDs in a globally sorted order for the given
>     > range of terms.
>     I understand everything above, I think.  Thanks for the prompt reply.
>     > This can get problematic if the range ends up being very large
>     because your
>     > iterator stack may periodically be destroyed and rebuilt.
>     This particular statement confused me.  When you said TreeSet,
>     you're talking
>     about a straight-forward in-memory collection from java.util or
>     similar, right?
>     Because I'm confused about which "iterator stack may periodically
>     be destroyed
>     and rebuilt."  It sounds like we're talking about some garbage
>     collection
>     specific to Accumulo.  Am I missing something here?
>     -Sukant