Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> RE: [External]  Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)


Copy link to this message
-
Re: [External] Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)

I thought I might too chime in late. I think we're talking about the
same thing, with perhaps different encoding.

Yes. In the bit-interleaving scheme I mentioned, each 3-bits of the hash
is equivalent to a level in an oct-tree ("3D quadtree"). ... and yes,
there is a trick to picking the time-scales right.

-- Kurt
On 6/25/13 9:08 AM, Jamie Stephens wrote:
> Adam & Co.,
>
> Sorry to chime in late here.
>
> One of our projects has similar requirements: queries based on
> time-space constraints. (Tracking a particular entity through time and
> space is a different requirement.)
>
> We've used the following scheme with decent results.
>
> Our basic approach is to use a 3D quadtree based on lat, lon, and
> time.  Longitude and time are first transformed to make a quadtree key
> prefix represent a cube (approximately).  Alternately roll your
> own quadtree algorithm to give similar results.  So some number of
> prefix bytes of a quadtree key represents an approximate time-space
> cube of dimensions 1km x 1km x 1day.  Pick your time unit.  Another
> variation: use a 3D geohash instead of a quadtree.
>
> Then use the first N bytes of the key as the row ID and the remaining
> bytes for the column qualifier.  Rationale: Sometimes there is virtue
> in keeping points in a cube on the same tablet server.  (Or you might
> want to, say, use only spatial key prefixes as row IDs.  Lots of
> flavors to consider.)
>
> Disadvantages: You have to pick N and the time unit up front.  N and
> the time unit are the basic index tuning parameters.  In our
> applications, setting those parameters isn't too hard because we
> understand the data and its uses pretty well.  However, as you've
> suggested, hotspots due to concentrations can still be a problem.  We
> try to turn up N to adjust.
>
> Variation: Use the military grid reference system (MGRS) grid zone
> designator and square identifier as row ID and a quadtree-code
> numerical location for the column qualifier.  Etc.
>
> I'll see if I can get an example on github.
>
> --Jamie
>
>
> On Mon, Jun 24, 2013 at 9:47 AM, Jim Klucar <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
>
>     Adam,
>
>     Usually with geo-queries points of interest are pretty dense (as
>     you've stated is your case). The indexing typically used (geohash
>     or z-order) is efficient for points spread evenly across the
>     earth, which isn't the typical case (think population density).
>     One method I've heard (never actually tried myself) is to store
>     points as distances from known locations. You can then find points
>     close to each other by finding similar distances to 2 or 3 known
>     locations. The known locations can then be created and distributed
>     based on your expected point density allowing even dense areas to
>     be spread evenly across a cluster.
>
>     There's plenty of math, table design, and query design work to get
>     it all working, but I think its feasible.
>
>     Jim
>
>
>     On Wed, Jun 19, 2013 at 1:07 AM, Kurt Christensen
>     <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
>
>
>         To clarify: By 'geologic', I was referring to time-scale (like
>         100s of millions of years, with more detail near present,
>         suggesting a log scale).
>
>         Your use of id is surprising. Maybe I don't understand what
>         you're trying to do.
>         >From what I was thinking, since you made reference to
>         time-series, no efficiency is gained through this id. If,
>         instead the id were for a whole time-series, and not each
>         individual point then for each timestamp, you would have X(id,
>         timestamp), Y(id, timestamp) and whatever else (id, timestamp)
>         already organized as time series. ... all with the same row id.
>         bithash+id, INDEX, id, ... - (query to get a list of IDs
>         intersecting your space-time region)
>         id, POSITION, XY, vis, TIMESTAMP, (x,y) - (use iterators to
Kurt Christensen
P.O. Box 811
Westminster, MD 21158-0811

"One of the penalties for refusing to participate in politics is that
you end up being governed by your inferiors."
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB