Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo, mail # user - Storing, Indexing, and Querying data in Accumulo (geo + timeseries)


+
Iezzi, Adam [USA] 2013-06-18, 01:56
Copy link to this message
-
Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)
Jared Winick 2013-06-18, 04:36
Have you considered a "geohash" of all 3 dimensions together and using that
as the RowID? I have never implemented a geohash exactly, but I do know it
is possible to build a z-order curve on more than 2 dimensions, which may
be what you want considering that it sounds like all your queries are in
3-dimensions.
On Mon, Jun 17, 2013 at 7:56 PM, Iezzi, Adam [USA] <[EMAIL PROTECTED]>wrote:

>  I’ve been asked by my client to store a dataset which contains a time
> series and geospatial coordinates (points) in Accumulo. At the moment, we
> have a very dense data stored in Accumulo using the following table schema:
> ****
>
> ** **
>
> Row ID:                <geohash>_<reverse timestamp>****
>
> Family:                  <id >****
>
> Qualifier:             attribute****
>
> Value:                   <value>****
>
> ** **
>
> We are salting our RowID’s with a geohash to prevent hot spotting. When we
> query the data, we use a prefix scan (center tile and eight neighbors),
> then using an Iterator to filter out the outliers (points and time).
> Unfortunately, we’ve noticed some performance issues with this approach in
> that it seems as the initial prefix scan brings back a ton of data, forcing
> the iterators to filter out a significant amount of outliers. E.g. more
> than 50% is being filtered out, which seems inefficient to us.
> Unfortunately for us, our users will always query by space and time, making
> them equally important for each query. Because of the time series component
> to our data, we’re often bringing back a significant amount of data for
> each given point. Each point can have ten entries due to the time series,
> making our data set very very dense. ****
>
> ** **
>
> The following are some options we’re considering:****
>
> ** **
>
> **1.       **Salt a master table with an ID rather than the geohash
> <id>_<reverse timestamp>, and then create a spatial index table. If we
> choose this option, I assume we would scan the index first, then use a
> batch scanner with the ID from the first query. Unfortunately, I still see
> us filtering out a significant amount of data using this approach.****
>
> **2.       **Keep the table design as is, and maybe a RegExFilter via a
> custom Iterator.****
>
> **3.       **Do something completely different, such as use a Column
> Family and the temporal aspect of the dataset together in some way.****
>
> ** **
>
> Any advice or guidance would be greatly appreciated.****
>
> ** **
>
> Thank you,****
>
> ** **
>
> Adam****
>
+
Kurt Christensen 2013-06-19, 00:53
+
Eric Newton 2013-06-18, 02:13