Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> more questions about IndexedDocIterators

Copy link to this message
Re: more questions about IndexedDocIterators
Another point RE #1: You always have the option of adding iterators to an
already-installed instance. If you want to use the Accumulo version of the
iterators, you can backport those relatively easily and then stick them in
a jar in the lib/ext directory. The only trick is that you need to avoid
classname collisions or the built-in iterators will get loaded instead of
the ones in lib/ext. Just change the package names if that is a problem.

I'm also curious as to how what you described in #2 works. It seems like
what you're doing could work, but the trouble with having billions of
"shards" is that you might have to search through a large number of them
linearly if you can't narrow down the set of candidate shards enough from
the start. It also suggests that each of your billions of shards is
probably small enough that you don't need to worry about keeping a complex
index, and you could just evaluate the entire shard in-memory. However, I
could be totally wrong about the expected distribution. Maybe you can fill
in some more details?

On Mon, Jul 16, 2012 at 9:34 AM, William Slacum <

> 1) The class hierarchy is a little convoluted, but there doesn't seem to
> be anything necessarily broken about the
> FamilyIntersectingIterator/IndexedDocIterator that would prevent it from
> being backported from trunk to a 1.3.x branch. AFAIK the
> SortedKeyValueIterator interface has remained unchanged between the initial
> 1.3 release up through our current trunk.
> 2) I'm a little confused as to what you mean by "sharding by document ID."
> Does this mean that for any given key, the row portion is a document ID? As
> far as reversing the timestamp, it seems reasonable if your queries are
> primarily of the form "give me documents within the past X time units."
> 3) What's your timestamp? If it's just a milliseconds-since-epoch
> timestamp, it's not unheard of to encode numeric values into an ordering
> that sorts lexicographically that isn't just padding with zeroes. The
> Wikipedia example has a NumberNormalizer that uses commons-lang to do this.
> As for hard numbers on performance with time and space, I don't have them.
> I would imagine you will see a difference in space and possibly time if the
> deserializing of the String is faster than what your'e using now.
> 4) I'd like to see your source. Have you looked at the
> IndexedDocIteratorTest to verify that it behaves properly? I'm surprised
> that it's returning you an index column family. Was your sample client
> running with the dummy negation you mentioned in #5?
> On Sun, Jul 15, 2012 at 7:05 PM, Sukant Hajra <[EMAIL PROTECTED]>wrote:
>> Hi all,
>> I have a mixed bag of questions to follow up on an earlier post inquiring
>> about
>> intersecting iterators now that I've done some prototyping:
>> 1. Do FamilyIntersectingIterators work in 1.3.4?
>> ------------------------------------------------
>> Does anyone know if FamilyIntersectingIterators were useable as far back
>> as
>> 1.3.4?  Or am I wasting my time on them at this old version (and need to
>> upgrade)?
>> I got a prototype of IndexedDocIterators working with Accumulo 1.4.1, but
>> currently have a hung thread in my attempt to use a
>> FamilyIntersectingIterator
>> with Cloudbase 1.3.4.  Also, I noticed the API changed somewhat to remove
>> some
>> oddly designed static configuration.
>> If FamilyIntersectingIterators were buggy, were there sufficient
>> work-arounds
>> to get some use out of them in 1.3.4?
>> Unfortunately, I need to jump through some political/social hoops to
>> upgrade,
>> but if it's got to be done, then I'll do what I have to.
>> 2. Is this approach reasonable?
>> -------------------------------
>> We're trying to be clever with our use of indexed docs.  We're less
>> interested
>> in searching over a large corpus of data in parallel, and more interested
>> in
>> doing some server-side joins in a data-local way (to reduce client burden