Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> RE: [External]  Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)

Copy link to this message
Re: [External] Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)
Adam & Co.,

Sorry to chime in late here.

One of our projects has similar requirements: queries based on time-space
constraints. (Tracking a particular entity through time and space is a
different requirement.)

We've used the following scheme with decent results.

Our basic approach is to use a 3D quadtree based on lat, lon, and time.
 Longitude and time are first transformed to make a quadtree key prefix
represent a cube (approximately).  Alternately roll your
own quadtree algorithm to give similar results.  So some number of prefix
bytes of a quadtree key represents an approximate time-space cube of
dimensions 1km x 1km x 1day.  Pick your time unit.  Another variation: use
a 3D geohash instead of a quadtree.

Then use the first N bytes of the key as the row ID and the remaining bytes
for the column qualifier.  Rationale: Sometimes there is virtue in keeping
points in a cube on the same tablet server.  (Or you might want to, say,
use only spatial key prefixes as row IDs.  Lots of flavors to consider.)

Disadvantages: You have to pick N and the time unit up front.  N and the
time unit are the basic index tuning parameters.  In our applications,
setting those parameters isn't too hard because we understand the data and
its uses pretty well.  However, as you've suggested, hotspots due to
concentrations can still be a problem.  We try to turn up N to adjust.

Variation: Use the military grid reference system (MGRS) grid zone
designator and square identifier as row ID and a quadtree-code numerical
location for the column qualifier.  Etc.

I'll see if I can get an example on github.

On Mon, Jun 24, 2013 at 9:47 AM, Jim Klucar <[EMAIL PROTECTED]> wrote:

> Adam,
> Usually with geo-queries points of interest are pretty dense (as you've
> stated is your case). The indexing typically used (geohash or z-order) is
> efficient for points spread evenly across the earth, which isn't the
> typical case (think population density). One method I've heard (never
> actually tried myself) is to store points as distances from known
> locations. You can then find points close to each other by finding similar
> distances to 2 or 3 known locations. The known locations can then be
> created and distributed based on your expected point density allowing even
> dense areas to be spread evenly across a cluster.
> There's plenty of math, table design, and query design work to get it all
> working, but I think its feasible.
> Jim
> On Wed, Jun 19, 2013 at 1:07 AM, Kurt Christensen <[EMAIL PROTECTED]>wrote:
>> To clarify: By 'geologic', I was referring to time-scale (like 100s of
>> millions of years, with more detail near present, suggesting a log scale).
>> Your use of id is surprising. Maybe I don't understand what you're trying
>> to do.
>> From what I was thinking, since you made reference to time-series, no
>> efficiency is gained through this id. If, instead the id were for a whole
>> time-series, and not each individual point then for each timestamp, you
>> would have X(id, timestamp), Y(id, timestamp) and whatever else (id,
>> timestamp) already organized as time series. ... all with the same row id.
>> bithash+id, INDEX, id, ... - (query to get a list of IDs intersecting
>> your space-time region)
>> id, POSITION, XY, vis, TIMESTAMP, (x,y) - (use iterators to filter these
>> points)
>> id, MEAS, name, vis, TIMESTAMP, named_measurement
>> Alternately, if you wanted rich points, and not individual values:
>> bithash+id, INDEX, id, ... - (query to get a list of IDs intersecting
>> your space-time region)
>> id, SAMPLE, (x,y), vis, TIMESTAMP, sampleObject(JSON?) - (all in one
>> column)
>> If this is way off base from what you are trying to do, please ignore.
>> Kurt
>> -----
>> On 6/18/13 10:14 PM, Iezzi, Adam [USA] wrote:
>>> All,
>>> Thank you for all of the replies. To answer some of the questions:
>>> Q: You say you have point data. Are time series geographically fixed,
>>> with only the time dimension changing? ... or are the time series moving in