Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - RE: [External]  Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)


Copy link to this message
-
Re: [External] Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)
Kurt Christensen 2013-06-25, 14:52

I thought I might too chime in late. I think we're talking about the
same thing, with perhaps different encoding.

Yes. In the bit-interleaving scheme I mentioned, each 3-bits of the hash
is equivalent to a level in an oct-tree ("3D quadtree"). ... and yes,
there is a trick to picking the time-scales right.

-- Kurt
On 6/25/13 9:08 AM, Jamie Stephens wrote:
> Adam & Co.,
>
> Sorry to chime in late here.
>
> One of our projects has similar requirements: queries based on
> time-space constraints. (Tracking a particular entity through time and
> space is a different requirement.)
>
> We've used the following scheme with decent results.
>
> Our basic approach is to use a 3D quadtree based on lat, lon, and
> time.  Longitude and time are first transformed to make a quadtree key
> prefix represent a cube (approximately).  Alternately roll your
> own quadtree algorithm to give similar results.  So some number of
> prefix bytes of a quadtree key represents an approximate time-space
> cube of dimensions 1km x 1km x 1day.  Pick your time unit.  Another
> variation: use a 3D geohash instead of a quadtree.
>
> Then use the first N bytes of the key as the row ID and the remaining
> bytes for the column qualifier.  Rationale: Sometimes there is virtue
> in keeping points in a cube on the same tablet server.  (Or you might
> want to, say, use only spatial key prefixes as row IDs.  Lots of
> flavors to consider.)
>
> Disadvantages: You have to pick N and the time unit up front.  N and
> the time unit are the basic index tuning parameters.  In our
> applications, setting those parameters isn't too hard because we
> understand the data and its uses pretty well.  However, as you've
> suggested, hotspots due to concentrations can still be a problem.  We
> try to turn up N to adjust.
>
> Variation: Use the military grid reference system (MGRS) grid zone
> designator and square identifier as row ID and a quadtree-code
> numerical location for the column qualifier.  Etc.
>
> I'll see if I can get an example on github.
>
> --Jamie
>
>
> On Mon, Jun 24, 2013 at 9:47 AM, Jim Klucar <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
>
>     Adam,
>
>     Usually with geo-queries points of interest are pretty dense (as
>     you've stated is your case). The indexing typically used (geohash
>     or z-order) is efficient for points spread evenly across the
>     earth, which isn't the typical case (think population density).
>     One method I've heard (never actually tried myself) is to store
>     points as distances from known locations. You can then find points
>     close to each other by finding similar distances to 2 or 3 known
>     locations. The known locations can then be created and distributed
>     based on your expected point density allowing even dense areas to
>     be spread evenly across a cluster.
>
>     There's plenty of math, table design, and query design work to get
>     it all working, but I think its feasible.
>
>     Jim
>
>
>     On Wed, Jun 19, 2013 at 1:07 AM, Kurt Christensen
>     <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
>
>
>         To clarify: By 'geologic', I was referring to time-scale (like
>         100s of millions of years, with more detail near present,
>         suggesting a log scale).
>
>         Your use of id is surprising. Maybe I don't understand what
>         you're trying to do.
>         >From what I was thinking, since you made reference to
>         time-series, no efficiency is gained through this id. If,
>         instead the id were for a whole time-series, and not each
>         individual point then for each timestamp, you would have X(id,
>         timestamp), Y(id, timestamp) and whatever else (id, timestamp)
>         already organized as time series. ... all with the same row id.
>         bithash+id, INDEX, id, ... - (query to get a list of IDs
>         intersecting your space-time region)
>         id, POSITION, XY, vis, TIMESTAMP, (x,y) - (use iterators to
Kurt Christensen
P.O. Box 811
Westminster, MD 21158-0811

"One of the penalties for refusing to participate in politics is that
you end up being governed by your inferiors."