Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - Storing, Indexing, and Querying data in Accumulo (geo + timeseries)


Copy link to this message
-
Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)
Eric Newton 2013-06-18, 02:13
It may be that the 50% you are filtering out would need to be
seeked/scanned anyhow, because they belong in blocks close to the data you
want.

Have you experimented with smaller tiles?

Do you know how your geohash is being mapped to nodes?  Does it spread well
over your cluster?  You may want to look at a custom balancer.

Can you give us an idea of the scale you are working at (10's, 100's,
1000's nodes)?

Can you describe in more detail what CF <id > is used for now?

-Eric
On Mon, Jun 17, 2013 at 9:56 PM, Iezzi, Adam [USA] <[EMAIL PROTECTED]>wrote:

>  I’ve been asked by my client to store a dataset which contains a time
> series and geospatial coordinates (points) in Accumulo. At the moment, we
> have a very dense data stored in Accumulo using the following table schema:
> ****
>
> ** **
>
> Row ID:                <geohash>_<reverse timestamp>****
>
> Family:                  <id >****
>
> Qualifier:             attribute****
>
> Value:                   <value>****
>
> ** **
>
> We are salting our RowID’s with a geohash to prevent hot spotting. When we
> query the data, we use a prefix scan (center tile and eight neighbors),
> then using an Iterator to filter out the outliers (points and time).
> Unfortunately, we’ve noticed some performance issues with this approach in
> that it seems as the initial prefix scan brings back a ton of data, forcing
> the iterators to filter out a significant amount of outliers. E.g. more
> than 50% is being filtered out, which seems inefficient to us.
> Unfortunately for us, our users will always query by space and time, making
> them equally important for each query. Because of the time series component
> to our data, we’re often bringing back a significant amount of data for
> each given point. Each point can have ten entries due to the time series,
> making our data set very very dense. ****
>
> ** **
>
> The following are some options we’re considering:****
>
> ** **
>
> **1.       **Salt a master table with an ID rather than the geohash
> <id>_<reverse timestamp>, and then create a spatial index table. If we
> choose this option, I assume we would scan the index first, then use a
> batch scanner with the ID from the first query. Unfortunately, I still see
> us filtering out a significant amount of data using this approach.****
>
> **2.       **Keep the table design as is, and maybe a RegExFilter via a
> custom Iterator.****
>
> **3.       **Do something completely different, such as use a Column
> Family and the temporal aspect of the dataset together in some way.****
>
> ** **
>
> Any advice or guidance would be greatly appreciated.****
>
> ** **
>
> Thank you,****
>
> ** **
>
> Adam****
>