-Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)
Kurt Christensen 2013-06-19, 00:53
An effective optimization strategy will be largely influenced by the
nature of your data.
You say you have point data. Are time series geographically fixed, with
only the time dimension changing? ... or are the time series moving in
I was going to suggest a 3-D approach, bit-interleaving your space and
time [modulo timespan] together ( or point-tree, or octtree, or k-d
trie, or r-d trie ). The trick there is to pick a time span large enough
so that any interval you query is small relative to the time span, but
small enough so that you don't waste a bunch (up to an eighth) of your
usable hash values with no useful time data (i.e. populate your most
significant bits). This would work if your data were geographically
fixed, but changing only in time. If your time span is geologic, you
might want to use a logarithmic time scale.
If you have time series (identified by <id>) moving in space-time, then
I would add an indirection. Use the space-time hash to determine the IDs
intersecting your zone and then query again, using the IDs to pull out
the time series, filtering with your interator, perhaps using the native
I hope that helps. Good luck.
BTW: 50% filtering isn't really that inefficient. - kkc
On 6/18/13 12:36 AM, Jared Winick wrote:
> Have you considered a "geohash" of all 3 dimensions together and using
> that as the RowID? I have never implemented a geohash exactly, but I
> do know it is possible to build a z-order curve on more than 2
> dimensions, which may be what you want considering that it sounds like
> all your queries are in 3-dimensions.
> On Mon, Jun 17, 2013 at 7:56 PM, Iezzi, Adam [USA] <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
> Iï¿½ve been asked by my client to store a dataset which contains a
> time series and geospatial coordinates (points) in Accumulo. At
> the moment, we have a very dense data stored in Accumulo using the
> following table schema:
> Row ID: <geohash>_<reverse timestamp>
> Family: <id >
> Qualifier: attribute
> Value: <value>
> We are salting our RowIDï¿½s with a geohash to prevent hot spotting.
> When we query the data, we use a prefix scan (center tile and
> eight neighbors), then using an Iterator to filter out the
> outliers (points and time). Unfortunately, weï¿½ve noticed some
> performance issues with this approach in that it seems as the
> initial prefix scan brings back a ton of data, forcing the
> iterators to filter out a significant amount of outliers. E.g.
> more than 50% is being filtered out, which seems inefficient to
> us. Unfortunately for us, our users will always query by space and
> time, making them equally important for each query. Because of the
> time series component to our data, weï¿½re often bringing back a
> significant amount of data for each given point. Each point can
> have ten entries due to the time series, making our data set very
> very dense.
> The following are some options weï¿½re considering:
> 1. Salt a master table with an ID rather than the geohash
> <id>_<reverse timestamp>, and then create a spatial index table.
> If we choose this option, I assume we would scan the index first,
> then use a batch scanner with the ID from the first query.
> Unfortunately, I still see us filtering out a significant amount
> of data using this approach.
> 2. Keep the table design as is, and maybe a RegExFilter via a
> custom Iterator.
> 3. Do something completely different, such as use a Column Family
> and the temporal aspect of the dataset together in some way.
> Any advice or guidance would be greatly appreciated.
> Thank you,
P.O. Box 811
Westminster, MD 21158-0811
"One of the penalties for refusing to participate in politics is that
you end up being governed by your inferiors."