Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - RE: [External]  Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)


Copy link to this message
-
Re: [External] Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)
Jim Klucar 2013-06-24, 14:47
Adam,

Usually with geo-queries points of interest are pretty dense (as you've
stated is your case). The indexing typically used (geohash or z-order) is
efficient for points spread evenly across the earth, which isn't the
typical case (think population density). One method I've heard (never
actually tried myself) is to store points as distances from known
locations. You can then find points close to each other by finding similar
distances to 2 or 3 known locations. The known locations can then be
created and distributed based on your expected point density allowing even
dense areas to be spread evenly across a cluster.

There's plenty of math, table design, and query design work to get it all
working, but I think its feasible.

Jim
On Wed, Jun 19, 2013 at 1:07 AM, Kurt Christensen <[EMAIL PROTECTED]> wrote:

>
> To clarify: By 'geologic', I was referring to time-scale (like 100s of
> millions of years, with more detail near present, suggesting a log scale).
>
> Your use of id is surprising. Maybe I don't understand what you're trying
> to do.
> From what I was thinking, since you made reference to time-series, no
> efficiency is gained through this id. If, instead the id were for a whole
> time-series, and not each individual point then for each timestamp, you
> would have X(id, timestamp), Y(id, timestamp) and whatever else (id,
> timestamp) already organized as time series. ... all with the same row id.
> bithash+id, INDEX, id, ... - (query to get a list of IDs intersecting your
> space-time region)
> id, POSITION, XY, vis, TIMESTAMP, (x,y) - (use iterators to filter these
> points)
> id, MEAS, name, vis, TIMESTAMP, named_measurement
>
> Alternately, if you wanted rich points, and not individual values:
> bithash+id, INDEX, id, ... - (query to get a list of IDs intersecting your
> space-time region)
> id, SAMPLE, (x,y), vis, TIMESTAMP, sampleObject(JSON?) - (all in one
> column)
>
> If this is way off base from what you are trying to do, please ignore.
>
> Kurt
>
> -----
>
>
> On 6/18/13 10:14 PM, Iezzi, Adam [USA] wrote:
>
>> All,
>>
>> Thank you for all of the replies. To answer some of the questions:
>>
>> Q: You say you have point data. Are time series geographically fixed,
>> with only the time dimension changing? ... or are the time series moving in
>> space-time?
>> A: The time series will be moving in space-time; therefore, the dataset
>> is geologic.
>>
>> Q: If you have time series (identified by<id>) moving in space-time, then
>> I would add an indirection.
>> A: Our dataset is very similar to what you describe. Each geospatial
>> point and time stamp is defined by an id.  Since I'm new to the Accumulo
>> world, I'm not very familiar with this pattern/approach in table design.
>> But, I will look around now that I have some guidance.
>>
>> Overall, I think I need to create a space-time hash of my dataset, but
>> the biggest question I have is, "what time span do I use?". At the moment,
>> I only have a years' worth of data; therefore, my MIN_DATE = Jan 01 and
>> MAX_DATE = Dec 31. But we obviously expect this data to continue to grow;
>> therefore, would want to account for additional data in the future.
>>
>> Thanks again for all of the guidance. I will digest some of the comments
>> and will report back.
>>
>> Adam
>>
>> -----Original Message-----
>> From: Kurt Christensen [mailto:[EMAIL PROTECTED]]
>> Sent: Tuesday, June 18, 2013 8:54 PM
>> To: [EMAIL PROTECTED]
>> Subject: [External] Re: Storing, Indexing, and Querying data in Accumulo
>> (geo + timeseries)
>>
>>
>> An effective optimization strategy will be largely influenced by the
>> nature of your data.
>>
>> You say you have point data. Are time series geographically fixed, with
>> only the time dimension changing? ... or are the time series moving in
>> space-time?
>>
>> I was going to suggest a 3-D approach, bit-interleaving your space and
>> time [modulo timespan] together ( or point-tree, or octtree, or k-d trie,
>> or r-d trie ). The trick there is to pick a time span large enough so that