Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> Storing, Indexing, and Querying data in Accumulo (geo + timeseries)


+
Iezzi, Adam [USA] 2013-06-18, 01:56
Copy link to this message
-
Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)
Have you considered a "geohash" of all 3 dimensions together and using that
as the RowID? I have never implemented a geohash exactly, but I do know it
is possible to build a z-order curve on more than 2 dimensions, which may
be what you want considering that it sounds like all your queries are in
3-dimensions.
On Mon, Jun 17, 2013 at 7:56 PM, Iezzi, Adam [USA] <[EMAIL PROTECTED]>wrote:

>  I’ve been asked by my client to store a dataset which contains a time
> series and geospatial coordinates (points) in Accumulo. At the moment, we
> have a very dense data stored in Accumulo using the following table schema:
> ****
>
> ** **
>
> Row ID:                <geohash>_<reverse timestamp>****
>
> Family:                  <id >****
>
> Qualifier:             attribute****
>
> Value:                   <value>****
>
> ** **
>
> We are salting our RowID’s with a geohash to prevent hot spotting. When we
> query the data, we use a prefix scan (center tile and eight neighbors),
> then using an Iterator to filter out the outliers (points and time).
> Unfortunately, we’ve noticed some performance issues with this approach in
> that it seems as the initial prefix scan brings back a ton of data, forcing
> the iterators to filter out a significant amount of outliers. E.g. more
> than 50% is being filtered out, which seems inefficient to us.
> Unfortunately for us, our users will always query by space and time, making
> them equally important for each query. Because of the time series component
> to our data, we’re often bringing back a significant amount of data for
> each given point. Each point can have ten entries due to the time series,
> making our data set very very dense. ****
>
> ** **
>
> The following are some options we’re considering:****
>
> ** **
>
> **1.       **Salt a master table with an ID rather than the geohash
> <id>_<reverse timestamp>, and then create a spatial index table. If we
> choose this option, I assume we would scan the index first, then use a
> batch scanner with the ID from the first query. Unfortunately, I still see
> us filtering out a significant amount of data using this approach.****
>
> **2.       **Keep the table design as is, and maybe a RegExFilter via a
> custom Iterator.****
>
> **3.       **Do something completely different, such as use a Column
> Family and the temporal aspect of the dataset together in some way.****
>
> ** **
>
> Any advice or guidance would be greatly appreciated.****
>
> ** **
>
> Thank you,****
>
> ** **
>
> Adam****
>
+
Kurt Christensen 2013-06-19, 00:53
+
Eric Newton 2013-06-18, 02:13
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB