Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> RE: EXTERNAL: Re: Custom Iterators


+
Cardon, Tejay E 2012-08-22, 22:22
+
Billie Rinaldi 2012-08-23, 16:07
+
Josh Elser 2012-08-22, 23:44
Copy link to this message
-
Re: EXTERNAL: Re: Custom Iterators
... and I just realized I was looking at the OrIterator in trunk, not
contrib/wikisearch x.x

Still, I think most of my comments still apply. Should verify with test
cases...

On 08/22/2012 06:44 PM, Josh Elser wrote:
> You could compare clone()'ing multiple sources inside of an iterator
> to maintaining multiple pointers at different offsets to a file on
> disk. The clone()'ed iterators are all operating over the same row;
> however, they are all pointing at different offsets (keys).
>
> Concretely, the OrIterator is sent a list of terms to union, and
> clone()'s the source it was given for each term (note the addTerm()
> method on the class). The OrIterator attempts to find the index
> entries for each term, and return the minimum docid to satisfy the
> SortedKeyValueIterator contract.
>
> Given your comment on the TermSource.compareTo() method's comment
> (....), yes, it does appear that you have found a bug. That comment
> about "multiple rows in a tablet" should really be removed, IMO. It's
> rather confusing, and shouldn't matter when you're writing an
> iterator. In other words, you, as a developer, don't need to know what
> rows are contained in a tablet. The only issue you need to worry about
> is if you're trying to do some operation *across* rows. Given that all
> of the index entries for a single document are contained in one row
> (which happens to just be a bucket in the Wiki application), this
> point is meaningless.
>
> You might also note that the next() method on the OrIterator doesn't
> check if the new topKey for the term it just advanced is contained in
> the current Range before adding it back to the PriorityQueue. This
> could cause a term who has passed outside of the initial Range
> provided to seek() to be added unnecessarily to said PriorityQueue.
>
> +2 bugs
>
> On 08/22/2012 05:22 PM, Cardon, Tejay E wrote:
>>
>> William,
>>
>> Thanks for the quick response. Let me start by stating what I
>> understand about Iterators (to be sure I�m not completely off my
>> rocker).
>>
>> 1. An iterator receives, as its source, another iterator (by way of
>> the init method), which becomes it�s source of data.
>>
>> 2. When seek is called on an iterator, the iterator should respond by
>> moving the pointer to the first key/value that applied to that
>> iterator and is within the range
>>
>> a. Depending on the iterator, that may not be the first key in the range
>>
>> b. Only keys (and their corresponding values) which include one of
>> the column families listed in the family list should be available as
>> topKey and topValue. (this restriction should continue until seek is
>> called again, meaning that subsequent calls to next will only proceed
>> to key/values that also match the list provided.
>>
>> c. Generally speaking, a seek will result in the iterator calling
>> seek on its source iterator (although the parameters passed in may be
>> different)
>>
>> 3. If an iterator needs configuration beyond just the source obtained
>> in the init call, it can get that through the options and/or env.
>>
>> 4. Iterators do not necessarily return the same types of key/values
>> as they consume. ie, a Combiner may call next() and getTopValue
>> multiple times each time those methods are called on it. And the
>> value it returns as topKey may be a key that doesn�t actually exist
>> in the datastore itself.
>>
>> So my questions:
>>
>> Is it correct that once seek is called, only topKeys that conform to
>> the columnFamilies collection should be returned. And that this
>> behavior persists until seek is called again, even when next has been
>> called?
>>
>> How do iterators like the OrIterator obtain multiple sources? (I
>> assume you were trying to address that with #3 in your response, but
>> I don�t understand what you mean by clone()ing the source. That would
>> give me copies of the one source, but not multiple sources)
>>
>> Why do some iterators have so many constructors if the system will
+
Cardon, Tejay E 2012-08-23, 14:17
+
Cardon, Tejay E 2012-08-22, 22:27
+
Cardon, Tejay E 2012-08-23, 14:43
+
Cardon, Tejay E 2012-08-23, 14:59
+
Marc P. 2012-08-23, 15:14
+
Cardon, Tejay E 2012-08-23, 15:30
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB