Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> more questions about IndexedDocIterators


Copy link to this message
-
Re: more questions about IndexedDocIterators
1) The class hierarchy is a little convoluted, but there doesn't seem to be
anything necessarily broken about the
FamilyIntersectingIterator/IndexedDocIterator that would prevent it from
being backported from trunk to a 1.3.x branch. AFAIK the
SortedKeyValueIterator interface has remained unchanged between the initial
1.3 release up through our current trunk.

2) I'm a little confused as to what you mean by "sharding by document ID."
Does this mean that for any given key, the row portion is a document ID? As
far as reversing the timestamp, it seems reasonable if your queries are
primarily of the form "give me documents within the past X time units."

3) What's your timestamp? If it's just a milliseconds-since-epoch
timestamp, it's not unheard of to encode numeric values into an ordering
that sorts lexicographically that isn't just padding with zeroes. The
Wikipedia example has a NumberNormalizer that uses commons-lang to do this.
As for hard numbers on performance with time and space, I don't have them.
I would imagine you will see a difference in space and possibly time if the
deserializing of the String is faster than what your'e using now.

4) I'd like to see your source. Have you looked at the
IndexedDocIteratorTest to verify that it behaves properly? I'm surprised
that it's returning you an index column family. Was your sample client
running with the dummy negation you mentioned in #5?

On Sun, Jul 15, 2012 at 7:05 PM, Sukant Hajra <[EMAIL PROTECTED]>wrote:

> Hi all,
>
> I have a mixed bag of questions to follow up on an earlier post inquiring
> about
> intersecting iterators now that I've done some prototyping:
>
>
> 1. Do FamilyIntersectingIterators work in 1.3.4?
> ------------------------------------------------
>
> Does anyone know if FamilyIntersectingIterators were useable as far back as
> 1.3.4?  Or am I wasting my time on them at this old version (and need to
> upgrade)?
>
> I got a prototype of IndexedDocIterators working with Accumulo 1.4.1, but
> currently have a hung thread in my attempt to use a
> FamilyIntersectingIterator
> with Cloudbase 1.3.4.  Also, I noticed the API changed somewhat to remove
> some
> oddly designed static configuration.
>
> If FamilyIntersectingIterators were buggy, were there sufficient
> work-arounds
> to get some use out of them in 1.3.4?
>
> Unfortunately, I need to jump through some political/social hoops to
> upgrade,
> but if it's got to be done, then I'll do what I have to.
>
>
> 2. Is this approach reasonable?
> -------------------------------
>
> We're trying to be clever with our use of indexed docs.  We're less
> interested
> in searching over a large corpus of data in parallel, and more interested
> in
> doing some server-side joins in a data-local way (to reduce client burden
> and
> network traffic).  So we're heavily "sharding" our documents (billions of
> shards) and using range constraints on the iterator to hone in on exactly
> one
> shard (new Range(shardId, shardId)).
>
> Let me give you a sense for what we're doing.  In one use case, we're using
> document-indexed iterators to accomodate both per-author and by-time
> accesses
> of a per-document commit log.  So we're sharding by document ID (and we
> have
> billions of documents).  Then we use the author ID as terms for each commit
> (one term per commit entry).  We use a reverse timestamp for the doc type,
> so
> we get back these entries in reverse time order.  In this way, we can scan
> the
> log for the entire document by time with plan iterators, and for a specific
> author with a document-indexed iterator (with a server-side join to the
> commit
> log entry).  Later on, we may index the log by other features with this
> approach.
>
> Is this strategy sane?  Is there precedent for doing it?  Is there a better
> alternative?
>
>
> 3. Compressed reverse-timestamp using Unicode tricks?
> ------------------------------------------------------
>
> I see code in Accumulo like
>
>     // We're past the index column family, so return a term that will sort
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB