Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> TimeSpan Iterator


Copy link to this message
-
Re: TimeSpan Iterator
On Tue, Aug 28, 2012 at 9:51 AM, <[EMAIL PROTECTED]> wrote:

> Billie****
>
> ** **
>
> Your comment “Users should be aware that this is not an efficient
> operation, though.” may help me decide if my current use of a secondary
> time index is better then.  Right now I maintain a table that has
> timestamps as the rowid whose values are the rowid in a metadata table.
> Therefore I do one range scan based on the timestamp.  Then a second lookup
> of the metadata rowid.  Is this more efficient?
>

It probably depends on what percentage of the data you're bringing back, as
compared to the amount you're scanning over (if that's not the whole
table).  I would hypothesize if you're bringing more than N% of the data
back, you might as well just use the TimestampFilter on the main table.  If
you're bringing a smaller percentage back, it could be better to reduce the
amount of the main table you have to scan over by maintaining a secondary
time index.  I'm not sure what N would be.  You should also make sure that
the secondary index is actually reducing the amount of the main table
you're scanning over, e.g. if each rowid had a full range of timestamps,
you could be pulling a list of all rowids back from the index table and not
reducing the scan over the main table.

Also, the TimestampFilter is not optimized.  Filters evaluate each
key/value pair to see if it is accepted (in this case, if it is in a
timestamp range).  If there are a lot of timestamps for each cell (keys
that are identical except for timestamp), it would be better to use seeking
instead.  That would involve writing a new iterator.  If there aren't many
timestamps for each cell, seeking won't help and the TimestampFilter will
be fine.

Billie

> ** **
>
> *From:* Billie Rinaldi [mailto:[EMAIL PROTECTED]]
> *Sent:* Tuesday, August 28, 2012 11:46
>
> *To:* [EMAIL PROTECTED]; [EMAIL PROTECTED]
> *Subject:* Re: TimeSpan Iterator****
>
> ** **
>
> On Tue, Aug 28, 2012 at 6:33 AM, John Armstrong <[EMAIL PROTECTED]> wrote:****
>
> On 08/28/2012 09:26 AM, [EMAIL PROTECTED] wrote:****
>
> Does anyone know of a TimeSpan Iterator that will fetch rows based on
> the accumulo timestamp?****
>
> ** **
>
> We actually wrote our own TimestampRangeIterator and TimestampSetIterator
> classes.  I don't know if 1.4 has any in the core libraries.  It's not very
> hard though.****
>
>
> There's a TimestampFilter in org.apache.accumulo.core.iterators.user in
> 1.4.  It uses a range of timestamps.  Users should be aware that this is
> not an efficient operation, though.
>
> Billie****
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB