-Re: [External] Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)
Kurt Christensen 2013-06-25, 14:52
I thought I might too chime in late. I think we're talking about the
same thing, with perhaps different encoding.
Yes. In the bit-interleaving scheme I mentioned, each 3-bits of the hash
is equivalent to a level in an oct-tree ("3D quadtree"). ... and yes,
there is a trick to picking the time-scales right.
On 6/25/13 9:08 AM, Jamie Stephens wrote:
> Adam & Co.,
> Sorry to chime in late here.
> One of our projects has similar requirements: queries based on
> time-space constraints. (Tracking a particular entity through time and
> space is a different requirement.)
> We've used the following scheme with decent results.
> Our basic approach is to use a 3D quadtree based on lat, lon, and
> time. Longitude and time are first transformed to make a quadtree key
> prefix represent a cube (approximately). Alternately roll your
> own quadtree algorithm to give similar results. So some number of
> prefix bytes of a quadtree key represents an approximate time-space
> cube of dimensions 1km x 1km x 1day. Pick your time unit. Another
> variation: use a 3D geohash instead of a quadtree.
> Then use the first N bytes of the key as the row ID and the remaining
> bytes for the column qualifier. Rationale: Sometimes there is virtue
> in keeping points in a cube on the same tablet server. (Or you might
> want to, say, use only spatial key prefixes as row IDs. Lots of
> flavors to consider.)
> Disadvantages: You have to pick N and the time unit up front. N and
> the time unit are the basic index tuning parameters. In our
> applications, setting those parameters isn't too hard because we
> understand the data and its uses pretty well. However, as you've
> suggested, hotspots due to concentrations can still be a problem. We
> try to turn up N to adjust.
> Variation: Use the military grid reference system (MGRS) grid zone
> designator and square identifier as row ID and a quadtree-code
> numerical location for the column qualifier. Etc.
> I'll see if I can get an example on github.
> On Mon, Jun 24, 2013 at 9:47 AM, Jim Klucar <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
> Usually with geo-queries points of interest are pretty dense (as
> you've stated is your case). The indexing typically used (geohash
> or z-order) is efficient for points spread evenly across the
> earth, which isn't the typical case (think population density).
> One method I've heard (never actually tried myself) is to store
> points as distances from known locations. You can then find points
> close to each other by finding similar distances to 2 or 3 known
> locations. The known locations can then be created and distributed
> based on your expected point density allowing even dense areas to
> be spread evenly across a cluster.
> There's plenty of math, table design, and query design work to get
> it all working, but I think its feasible.
> On Wed, Jun 19, 2013 at 1:07 AM, Kurt Christensen
> <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
> To clarify: By 'geologic', I was referring to time-scale (like
> 100s of millions of years, with more detail near present,
> suggesting a log scale).
> Your use of id is surprising. Maybe I don't understand what
> you're trying to do.
> >From what I was thinking, since you made reference to
> time-series, no efficiency is gained through this id. If,
> instead the id were for a whole time-series, and not each
> individual point then for each timestamp, you would have X(id,
> timestamp), Y(id, timestamp) and whatever else (id, timestamp)
> already organized as time series. ... all with the same row id.
> bithash+id, INDEX, id, ... - (query to get a list of IDs
> intersecting your space-time region)
> id, POSITION, XY, vis, TIMESTAMP, (x,y) - (use iterators to
P.O. Box 811
Westminster, MD 21158-0811
"One of the penalties for refusing to participate in politics is that
you end up being governed by your inferiors."