Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> more questions about IndexedDocIterators


Copy link to this message
-
Re: more questions about IndexedDocIterators
*SNIP

> > 3. Compressed reverse-timestamp using Unicode tricks?
> > ------------------------------------------------------
> >
> > I see code in Accumulo like
> >
> > // We're past the index column family, so return a term that will sort
> > // lexicographically last. The last unicode character should suffice
> > return new Text("\uFFFD");
> >
> > which gets me thinking that i can probably pull off a impressively
> > compressed,
> > but still lexically orderd, reverse timestamp using Unicode trickery
> > to get a
> > gigantic radix. Is there any precedence for this? I'm a little worried
> > about
> > running into corner cases with Unicode encoding. Otherwise, I think it
> > feels
> > like a simple algorithm that may not eat up much CPU in translation
> > and might
> > save disk space at scale.
> >
> > Or is this optimizing into the noise given compression Accumulo
> > already does
> > under the covers?
>
> I would think the compression would take care of this.  If you try it and
> get an improvement, we'd be interested in seeing the results.
>
>
I think it is generally a good idea to use encoding techniques whenever
they're quick, effective, and easy. If you know something about your data
then you can usually do better than a general-purpose compression
algorithm. Slide 11 of my table design presentation (
http://people.apache.org/~afuchs/slides/accumulo_table_design.pdf) also
shows a few extra tricks that might help you out. Another possibility is to
use a two's complement representation for a fixed precision number (e.g. a
long or an int), but flip the first bit.

Cheers,
Adam