Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> RE: EXTERNAL: Re: Custom Iterators


Copy link to this message
-
Re: EXTERNAL: Re: Custom Iterators
... and I just realized I was looking at the OrIterator in trunk, not
contrib/wikisearch x.x

Still, I think most of my comments still apply. Should verify with test
cases...

On 08/22/2012 06:44 PM, Josh Elser wrote:
> You could compare clone()'ing multiple sources inside of an iterator
> to maintaining multiple pointers at different offsets to a file on
> disk. The clone()'ed iterators are all operating over the same row;
> however, they are all pointing at different offsets (keys).
>
> Concretely, the OrIterator is sent a list of terms to union, and
> clone()'s the source it was given for each term (note the addTerm()
> method on the class). The OrIterator attempts to find the index
> entries for each term, and return the minimum docid to satisfy the
> SortedKeyValueIterator contract.
>
> Given your comment on the TermSource.compareTo() method's comment
> (....), yes, it does appear that you have found a bug. That comment
> about "multiple rows in a tablet" should really be removed, IMO. It's
> rather confusing, and shouldn't matter when you're writing an
> iterator. In other words, you, as a developer, don't need to know what
> rows are contained in a tablet. The only issue you need to worry about
> is if you're trying to do some operation *across* rows. Given that all
> of the index entries for a single document are contained in one row
> (which happens to just be a bucket in the Wiki application), this
> point is meaningless.
>
> You might also note that the next() method on the OrIterator doesn't
> check if the new topKey for the term it just advanced is contained in
> the current Range before adding it back to the PriorityQueue. This
> could cause a term who has passed outside of the initial Range
> provided to seek() to be added unnecessarily to said PriorityQueue.
>
> +2 bugs
>
> On 08/22/2012 05:22 PM, Cardon, Tejay E wrote:
>>
>> William,
>>
>> Thanks for the quick response. Let me start by stating what I
>> understand about Iterators (to be sure I�m not completely off my
>> rocker).
>>
>> 1. An iterator receives, as its source, another iterator (by way of
>> the init method), which becomes it�s source of data.
>>
>> 2. When seek is called on an iterator, the iterator should respond by
>> moving the pointer to the first key/value that applied to that
>> iterator and is within the range
>>
>> a. Depending on the iterator, that may not be the first key in the range
>>
>> b. Only keys (and their corresponding values) which include one of
>> the column families listed in the family list should be available as
>> topKey and topValue. (this restriction should continue until seek is
>> called again, meaning that subsequent calls to next will only proceed
>> to key/values that also match the list provided.
>>
>> c. Generally speaking, a seek will result in the iterator calling
>> seek on its source iterator (although the parameters passed in may be
>> different)
>>
>> 3. If an iterator needs configuration beyond just the source obtained
>> in the init call, it can get that through the options and/or env.
>>
>> 4. Iterators do not necessarily return the same types of key/values
>> as they consume. ie, a Combiner may call next() and getTopValue
>> multiple times each time those methods are called on it. And the
>> value it returns as topKey may be a key that doesn�t actually exist
>> in the datastore itself.
>>
>> So my questions:
>>
>> Is it correct that once seek is called, only topKeys that conform to
>> the columnFamilies collection should be returned. And that this
>> behavior persists until seek is called again, even when next has been
>> called?
>>
>> How do iterators like the OrIterator obtain multiple sources? (I
>> assume you were trying to address that with #3 in your response, but
>> I don�t understand what you mean by clone()ing the source. That would
>> give me copies of the one source, but not multiple sources)
>>
>> Why do some iterators have so many constructors if the system will