Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> more questions about IndexedDocIterators


Copy link to this message
-
Re: more questions about IndexedDocIterators
*SNIP

> > 3. Compressed reverse-timestamp using Unicode tricks?
> > ------------------------------------------------------
> >
> > I see code in Accumulo like
> >
> > // We're past the index column family, so return a term that will sort
> > // lexicographically last. The last unicode character should suffice
> > return new Text("\uFFFD");
> >
> > which gets me thinking that i can probably pull off a impressively
> > compressed,
> > but still lexically orderd, reverse timestamp using Unicode trickery
> > to get a
> > gigantic radix. Is there any precedence for this? I'm a little worried
> > about
> > running into corner cases with Unicode encoding. Otherwise, I think it
> > feels
> > like a simple algorithm that may not eat up much CPU in translation
> > and might
> > save disk space at scale.
> >
> > Or is this optimizing into the noise given compression Accumulo
> > already does
> > under the covers?
>
> I would think the compression would take care of this.  If you try it and
> get an improvement, we'd be interested in seeing the results.
>
>
I think it is generally a good idea to use encoding techniques whenever
they're quick, effective, and easy. If you know something about your data
then you can usually do better than a general-purpose compression
algorithm. Slide 11 of my table design presentation (
http://people.apache.org/~afuchs/slides/accumulo_table_design.pdf) also
shows a few extra tricks that might help you out. Another possibility is to
use a two's complement representation for a fixed precision number (e.g. a
long or an int), but flip the first bit.

Cheers,
Adam
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB