THORMAN, ROBERT D

2014-07-24, 14:13

William Slacum

2014-07-24, 17:27

William Slacum

2014-07-24, 17:27

Russ Weeks

2014-07-24, 17:31

Russ Weeks

2014-07-24, 17:31

Hunter Provyn

2014-07-24, 19:06

Sean Busbey

2014-07-25, 02:14

Sean Busbey

2014-07-25, 02:14

Anthony F

2014-07-25, 15:45

Anthony F

2014-07-25, 15:45

Sean Busbey

2014-07-25, 16:38

Sean Busbey

2014-07-25, 16:38

Chris Bennight

2014-07-25, 22:52

Chris Bennight

2014-07-25, 22:52

Kurt Christensen

2014-07-26, 12:59

Anthony Fox

2014-07-27, 20:15

Anthony Fox

2014-07-27, 20:15

Jared Winick

2014-07-28, 18:01

Jared Winick

2014-07-28, 18:01

Corey Nolet

2014-07-28, 21:11

Corey Nolet

2014-07-28, 21:11

Chris Bennight

2014-07-29, 00:08

Chris Bennight

2014-07-29, 00:08

Chris Bennight

2014-07-29, 04:12

Chris Bennight

2014-07-29, 04:12

- Accumulo
- mail # user
- Z-Curve/Hilbert Curve

Can anyone share a Java method to convert lat/lon (decimal degrees) to Z-Curve (string)? I need to order my geo-spatial data into lexical order.

v/r

Bob Thorman

Principal Big Data Engineer

AT&T Big Data CoE

2900 W. Plano Parkway

Plano, TX 75075

972-658-1714

v/r

Bob Thorman

Principal Big Data Engineer

AT&T Big Data CoE

2900 W. Plano Parkway

Plano, TX 75075

972-658-1714

Quick google search yielded:

https://github.com/GeoLatte/geolatte-geom/blob/master/src/main/java/org/geolatte/geom/curve/MortonCode.java

On Thu, Jul 24, 2014 at 10:10 AM, THORMAN, ROBERT D <[EMAIL PROTECTED]> wrote:

https://github.com/GeoLatte/geolatte-geom/blob/master/src/main/java/org/geolatte/geom/curve/MortonCode.java

On Thu, Jul 24, 2014 at 10:10 AM, THORMAN, ROBERT D <[EMAIL PROTECTED]> wrote:

https://github.com/GeoLatte/geolatte-geom/blob/master/src/main/java/org/geolatte/geom/curve/MortonCode.java

On Thu, Jul 24, 2014 at 10:10 AM, THORMAN, ROBERT D <[EMAIL PROTECTED]> wrote:

Does it have to be a Z-curve? The excellent Accumulo recipes project has a

very easy-to-use geospatial index based on a quadtree.

https://github.com/calrissian/accumulo-recipes/tree/master/store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/geospatialstore/support

-Russ

On Thu, Jul 24, 2014 at 10:17 AM, William Slacum <

[EMAIL PROTECTED]> wrote:

very easy-to-use geospatial index based on a quadtree.

https://github.com/calrissian/accumulo-recipes/tree/master/store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/geospatialstore/support

-Russ

On Thu, Jul 24, 2014 at 10:17 AM, William Slacum <

[EMAIL PROTECTED]> wrote:

Does it have to be a Z-curve? The excellent Accumulo recipes project has a

very easy-to-use geospatial index based on a quadtree.

https://github.com/calrissian/accumulo-recipes/tree/master/store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/geospatialstore/support

-Russ

On Thu, Jul 24, 2014 at 10:17 AM, William Slacum <

[EMAIL PROTECTED]> wrote:

very easy-to-use geospatial index based on a quadtree.

https://github.com/calrissian/accumulo-recipes/tree/master/store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/geospatialstore/support

-Russ

On Thu, Jul 24, 2014 at 10:17 AM, William Slacum <

[EMAIL PROTECTED]> wrote:

Bob,

There are a number of open source packages / collections of useful

things out there you could use to generate Z-curves for accumulo:

GeoMesa - http://geomesa.github.io/,

https://github.com/locationtech/geomesa/

(Disclaimer: I'm a commiter to GeoMesa)

Calrissian -

https://github.com/calrissian/accumulo-recipes/tree/master/store/geospatial-store

GeoWave - https://github.com/ngageoint/geowave

GeoMesa currently supports accumulo 1.4 and 1.5. It has a GeoServer

plugin to facilitate analysis.

Hunter

On 07/24/2014 10:10 AM, THORMAN, ROBERT D wrote:

There are a number of open source packages / collections of useful

things out there you could use to generate Z-curves for accumulo:

GeoMesa - http://geomesa.github.io/,

https://github.com/locationtech/geomesa/

(Disclaimer: I'm a commiter to GeoMesa)

Calrissian -

https://github.com/calrissian/accumulo-recipes/tree/master/store/geospatial-store

GeoWave - https://github.com/ngageoint/geowave

GeoMesa currently supports accumulo 1.4 and 1.5. It has a GeoServer

plugin to facilitate analysis.

Hunter

On 07/24/2014 10:10 AM, THORMAN, ROBERT D wrote:

Hi Hunter!

Any plans to update GeoMesa for 1.6?

Could we convince y'all to contribute a Z-curve lexicoder for the core

project?

http://accumulo.apache.org/1.6/accumulo_user_manual.html#_lexicoders

Sean

On Jul 24, 2014 2:06 PM, "Hunter Provyn" <[EMAIL PROTECTED]> wrote:

Any plans to update GeoMesa for 1.6?

Could we convince y'all to contribute a Z-curve lexicoder for the core

project?

http://accumulo.apache.org/1.6/accumulo_user_manual.html#_lexicoders

Sean

On Jul 24, 2014 2:06 PM, "Hunter Provyn" <[EMAIL PROTECTED]> wrote:

Hi Hunter!

Any plans to update GeoMesa for 1.6?

Could we convince y'all to contribute a Z-curve lexicoder for the core

project?

http://accumulo.apache.org/1.6/accumulo_user_manual.html#_lexicoders

Sean

On Jul 24, 2014 2:06 PM, "Hunter Provyn" <[EMAIL PROTECTED]> wrote:

Any plans to update GeoMesa for 1.6?

Could we convince y'all to contribute a Z-curve lexicoder for the core

project?

http://accumulo.apache.org/1.6/accumulo_user_manual.html#_lexicoders

Sean

On Jul 24, 2014 2:06 PM, "Hunter Provyn" <[EMAIL PROTECTED]> wrote:

Sean,

GeoMesa plans to update to 1.6 as soon as we get past a 1.0 RELEASE (which

is waiting on approval by the Eclipse Foundation's IP review process and

should happen in a month or so). I think we could break out a z-curve

lexicoder and contribute before then though. I'll poke around at the

lexicoder stuff in 1.6 and see what it would take to use that interface now.

Thanks,

Anthony

On Thu, Jul 24, 2014 at 10:14 PM, Sean Busbey <[EMAIL PROTECTED]> wrote:

GeoMesa plans to update to 1.6 as soon as we get past a 1.0 RELEASE (which

is waiting on approval by the Eclipse Foundation's IP review process and

should happen in a month or so). I think we could break out a z-curve

lexicoder and contribute before then though. I'll poke around at the

lexicoder stuff in 1.6 and see what it would take to use that interface now.

Thanks,

Anthony

On Thu, Jul 24, 2014 at 10:14 PM, Sean Busbey <[EMAIL PROTECTED]> wrote:

Sean,

GeoMesa plans to update to 1.6 as soon as we get past a 1.0 RELEASE (which

is waiting on approval by the Eclipse Foundation's IP review process and

should happen in a month or so). I think we could break out a z-curve

lexicoder and contribute before then though. I'll poke around at the

lexicoder stuff in 1.6 and see what it would take to use that interface now.

Thanks,

Anthony

On Thu, Jul 24, 2014 at 10:14 PM, Sean Busbey <[EMAIL PROTECTED]> wrote:

GeoMesa plans to update to 1.6 as soon as we get past a 1.0 RELEASE (which

is waiting on approval by the Eclipse Foundation's IP review process and

should happen in a month or so). I think we could break out a z-curve

lexicoder and contribute before then though. I'll poke around at the

lexicoder stuff in 1.6 and see what it would take to use that interface now.

Thanks,

Anthony

On Thu, Jul 24, 2014 at 10:14 PM, Sean Busbey <[EMAIL PROTECTED]> wrote:

On Fri, Jul 25, 2014 at 8:45 AM, Anthony F <[EMAIL PROTECTED]> wrote:

That sounds great Anthony. Let us know if we can help with anything!

Sean

That sounds great Anthony. Let us know if we can help with anything!

Sean

On Fri, Jul 25, 2014 at 8:45 AM, Anthony F <[EMAIL PROTECTED]> wrote:

That sounds great Anthony. Let us know if we can help with anything!

Sean

That sounds great Anthony. Let us know if we can help with anything!

Sean

A couple of related issues come up when considering implementing a

dimensionality reducing encoding -- just want to toss those out to see what

people think the interface might look like.

There's a couple of aspects that could be brought in here, but lets keep it

simple and considering the original question: (lat/lon) -> number.

The more bits we add to the z-curve, the more precise our comparison - i.e.

a 63 bit key would have more "locations" to sort by than a 24 bit key.

Would you see a reasonable default getting picked, make this user

configurable, or both? (i.e. default to a value, extended options with a

new constructor?)

I'm extrapolating here, but the only reason I see that locality matters is

if we want to preserve locality for range searches. The internal

implementation of the encoding/lexicoding process is going to directly

impact the implementation of the range query.

Now sure, someone could encode the lower left point, encode the upper right

point, and construct a range out of that to pass for a scan, but that's

going to be wildly inefficient in most cases. See:

https://dl.dropboxusercontent.com/u/6649380/bbox.png

If we just lexicode the lower left and upper right we traverse across the

entire curve - hitting lots of areas that aren't actually in the original

range.

Now we can turn a single 2D range into a set of 1D ranges. There is some

potential tuning here now, as the algorithm has a tradeoff on time to

compute the ranges (and number of ranges) vs. "slop" (or inclusion of

ranges which aren't actually in the original query).

Would you see a static method perhaps on the z-curve lexicoder that returns

a series of ranges based on an input window? Some other mechanism?

And in the case of "slop" - would we just document that the ranges could

actually include values not expected - or would we always fully decompose?

The other, "making it more complicated" questions would probably resolve

around generalizing this to a multi-dimensional lexicoder. I would expect

the WGS84 (lat/long) encoder would just be a 2D instance with -180/+180

-90/+90 bounds. There are probably completely different cases where it

would be useful to have a locality sensitive hash.

But with the big caveat that this requires more methods - I now need a way

of defining a range for each of the dimensions (which before were defined

as lat/lon), and I need an interface/class to pass those dimensions to the

encoder - and should be able to use those same definitions to decode my

values on the way out.

I'm not trying to make a mountain out of a molehill - one could definitely

pick some sane defaults, document behaviour, and put in a WGS84 specific

implementation. I'm just thinking what's the minimum viable bit that's

needed for this to be useful - and I suspect it's also the range

decomposition piece - as I suspect *everyone* would really need that (if

they cared about locality in the first place).

My main question here would be to what extent would you see this going? A

one off for WGS84, or a more generic multi-dimensional lexicoder; and in

either case, what would the thoughts/consensus be for implementation of the

additional methods (and possibly classes) required?

On Fri, Jul 25, 2014 at 12:37 PM, Sean Busbey <[EMAIL PROTECTED]> wrote:

dimensionality reducing encoding -- just want to toss those out to see what

people think the interface might look like.

There's a couple of aspects that could be brought in here, but lets keep it

simple and considering the original question: (lat/lon) -> number.

The more bits we add to the z-curve, the more precise our comparison - i.e.

a 63 bit key would have more "locations" to sort by than a 24 bit key.

Would you see a reasonable default getting picked, make this user

configurable, or both? (i.e. default to a value, extended options with a

new constructor?)

I'm extrapolating here, but the only reason I see that locality matters is

if we want to preserve locality for range searches. The internal

implementation of the encoding/lexicoding process is going to directly

impact the implementation of the range query.

Now sure, someone could encode the lower left point, encode the upper right

point, and construct a range out of that to pass for a scan, but that's

going to be wildly inefficient in most cases. See:

https://dl.dropboxusercontent.com/u/6649380/bbox.png

If we just lexicode the lower left and upper right we traverse across the

entire curve - hitting lots of areas that aren't actually in the original

range.

Now we can turn a single 2D range into a set of 1D ranges. There is some

potential tuning here now, as the algorithm has a tradeoff on time to

compute the ranges (and number of ranges) vs. "slop" (or inclusion of

ranges which aren't actually in the original query).

Would you see a static method perhaps on the z-curve lexicoder that returns

a series of ranges based on an input window? Some other mechanism?

And in the case of "slop" - would we just document that the ranges could

actually include values not expected - or would we always fully decompose?

The other, "making it more complicated" questions would probably resolve

around generalizing this to a multi-dimensional lexicoder. I would expect

the WGS84 (lat/long) encoder would just be a 2D instance with -180/+180

-90/+90 bounds. There are probably completely different cases where it

would be useful to have a locality sensitive hash.

But with the big caveat that this requires more methods - I now need a way

of defining a range for each of the dimensions (which before were defined

as lat/lon), and I need an interface/class to pass those dimensions to the

encoder - and should be able to use those same definitions to decode my

values on the way out.

I'm not trying to make a mountain out of a molehill - one could definitely

pick some sane defaults, document behaviour, and put in a WGS84 specific

implementation. I'm just thinking what's the minimum viable bit that's

needed for this to be useful - and I suspect it's also the range

decomposition piece - as I suspect *everyone* would really need that (if

they cared about locality in the first place).

My main question here would be to what extent would you see this going? A

one off for WGS84, or a more generic multi-dimensional lexicoder; and in

either case, what would the thoughts/consensus be for implementation of the

additional methods (and possibly classes) required?

On Fri, Jul 25, 2014 at 12:37 PM, Sean Busbey <[EMAIL PROTECTED]> wrote:

dimensionality reducing encoding -- just want to toss those out to see what

people think the interface might look like.

There's a couple of aspects that could be brought in here, but lets keep it

simple and considering the original question: (lat/lon) -> number.

The more bits we add to the z-curve, the more precise our comparison - i.e.

a 63 bit key would have more "locations" to sort by than a 24 bit key.

Would you see a reasonable default getting picked, make this user

configurable, or both? (i.e. default to a value, extended options with a

new constructor?)

I'm extrapolating here, but the only reason I see that locality matters is

if we want to preserve locality for range searches. The internal

implementation of the encoding/lexicoding process is going to directly

impact the implementation of the range query.

Now sure, someone could encode the lower left point, encode the upper right

point, and construct a range out of that to pass for a scan, but that's

going to be wildly inefficient in most cases. See:

https://dl.dropboxusercontent.com/u/6649380/bbox.png

If we just lexicode the lower left and upper right we traverse across the

entire curve - hitting lots of areas that aren't actually in the original

range.

Now we can turn a single 2D range into a set of 1D ranges. There is some

potential tuning here now, as the algorithm has a tradeoff on time to

compute the ranges (and number of ranges) vs. "slop" (or inclusion of

ranges which aren't actually in the original query).

Would you see a static method perhaps on the z-curve lexicoder that returns

a series of ranges based on an input window? Some other mechanism?

And in the case of "slop" - would we just document that the ranges could

actually include values not expected - or would we always fully decompose?

The other, "making it more complicated" questions would probably resolve

around generalizing this to a multi-dimensional lexicoder. I would expect

the WGS84 (lat/long) encoder would just be a 2D instance with -180/+180

-90/+90 bounds. There are probably completely different cases where it

would be useful to have a locality sensitive hash.

But with the big caveat that this requires more methods - I now need a way

of defining a range for each of the dimensions (which before were defined

as lat/lon), and I need an interface/class to pass those dimensions to the

encoder - and should be able to use those same definitions to decode my

values on the way out.

I'm not trying to make a mountain out of a molehill - one could definitely

pick some sane defaults, document behaviour, and put in a WGS84 specific

implementation. I'm just thinking what's the minimum viable bit that's

needed for this to be useful - and I suspect it's also the range

decomposition piece - as I suspect *everyone* would really need that (if

they cared about locality in the first place).

My main question here would be to what extent would you see this going? A

one off for WGS84, or a more generic multi-dimensional lexicoder; and in

either case, what would the thoughts/consensus be for implementation of the

additional methods (and possibly classes) required?

On Fri, Jul 25, 2014 at 12:37 PM, Sean Busbey <[EMAIL PROTECTED]> wrote:

The first thing that comes to my mind is bit shuffling. Scale and shift

lat and lon into [0,1) ranges. That is from 0.000... to 0.111... Then,

starting just past the decimal point, take the first lon bit (first

since double the range of lat), then the first lat bit. Then just

alternate, second bits, third bits, until you have as much precision as

you like. There are discontinuities in the resulting curve, but any

point is easy to encode and decode, and it provides a 1-dimensional sort.

I hope that helps.

Kurt

On 7/24/14 10:10 AM, THORMAN, ROBERT D wrote:

Kurt Christensen

P.O. Box 811

Westminster, MD 21158-0811

If you can't explain it simply, you don't understand it well enough. --

Albert Einstein

lat and lon into [0,1) ranges. That is from 0.000... to 0.111... Then,

starting just past the decimal point, take the first lon bit (first

since double the range of lat), then the first lat bit. Then just

alternate, second bits, third bits, until you have as much precision as

you like. There are discontinuities in the resulting curve, but any

point is easy to encode and decode, and it provides a 1-dimensional sort.

I hope that helps.

Kurt

On 7/24/14 10:10 AM, THORMAN, ROBERT D wrote:

Kurt Christensen

P.O. Box 811

Westminster, MD 21158-0811

If you can't explain it simply, you don't understand it well enough. --

Albert Einstein

My first thought was just something simple for a first pass - lat/lon -> a

single lexicographic dimension - as it would cover most basic use cases.

Precision (number of bits in encoding) could be an arg or a config

variable. For WITHIN/INTERSECTS topological predicates, we need to

decompose the query geometry into the (possibly/probably non-contiguous) 1D

ranges that cover the region in question. GeoMesa has an algorithm to

quickly perform this decomposition that computes the minimum number of

geohash prefixes at different resolutions to fully cover the query

polygon. Basically, it recursively traverses through levels of geohash

resolutions, prioritizing rectangles that intersect the region and

discarding non-intersecting rectangles at the lowest precisions, to produce

a sequence of ranges to scan over. Fully contained rectangles are

discovered at their lowest resolution at which point the algorithm pops the

stack and searches the next prioritized prefix. I think something like

this would definitely need to be ported over and included in a lexicoder

implementation to make it useful. Also, rather than materialize the entire

set of ranges in memory, we can either return a lazy iterator of prefixes

that can be fed into a scanner in batches or we can have a short-circuit

config that tunes the amount of slop that's tolerable and cuts off the

traversal at a certain level of precision. GeoMesa uses something like the

former on attribute indexes to coordinate parallel scanners on separate

index and record tables.

Thoughts? I'm inclined to keep the implementation to the bare minimum

necessary for the basic use cases (lat/lon and bbox queries) though I do

think a general dimensionality reducing lexicoder would be very useful.

On Fri, Jul 25, 2014 at 6:51 PM, Chris Bennight <[EMAIL PROTECTED]> wrote:

single lexicographic dimension - as it would cover most basic use cases.

Precision (number of bits in encoding) could be an arg or a config

variable. For WITHIN/INTERSECTS topological predicates, we need to

decompose the query geometry into the (possibly/probably non-contiguous) 1D

ranges that cover the region in question. GeoMesa has an algorithm to

quickly perform this decomposition that computes the minimum number of

geohash prefixes at different resolutions to fully cover the query

polygon. Basically, it recursively traverses through levels of geohash

resolutions, prioritizing rectangles that intersect the region and

discarding non-intersecting rectangles at the lowest precisions, to produce

a sequence of ranges to scan over. Fully contained rectangles are

discovered at their lowest resolution at which point the algorithm pops the

stack and searches the next prioritized prefix. I think something like

this would definitely need to be ported over and included in a lexicoder

implementation to make it useful. Also, rather than materialize the entire

set of ranges in memory, we can either return a lazy iterator of prefixes

that can be fed into a scanner in batches or we can have a short-circuit

config that tunes the amount of slop that's tolerable and cuts off the

traversal at a certain level of precision. GeoMesa uses something like the

former on attribute indexes to coordinate parallel scanners on separate

index and record tables.

Thoughts? I'm inclined to keep the implementation to the bare minimum

necessary for the basic use cases (lat/lon and bbox queries) though I do

think a general dimensionality reducing lexicoder would be very useful.

On Fri, Jul 25, 2014 at 6:51 PM, Chris Bennight <[EMAIL PROTECTED]> wrote:

My first thought was just something simple for a first pass - lat/lon -> a

single lexicographic dimension - as it would cover most basic use cases.

Precision (number of bits in encoding) could be an arg or a config

variable. For WITHIN/INTERSECTS topological predicates, we need to

decompose the query geometry into the (possibly/probably non-contiguous) 1D

ranges that cover the region in question. GeoMesa has an algorithm to

quickly perform this decomposition that computes the minimum number of

geohash prefixes at different resolutions to fully cover the query

polygon. Basically, it recursively traverses through levels of geohash

resolutions, prioritizing rectangles that intersect the region and

discarding non-intersecting rectangles at the lowest precisions, to produce

a sequence of ranges to scan over. Fully contained rectangles are

discovered at their lowest resolution at which point the algorithm pops the

stack and searches the next prioritized prefix. I think something like

this would definitely need to be ported over and included in a lexicoder

implementation to make it useful. Also, rather than materialize the entire

set of ranges in memory, we can either return a lazy iterator of prefixes

that can be fed into a scanner in batches or we can have a short-circuit

config that tunes the amount of slop that's tolerable and cuts off the

traversal at a certain level of precision. GeoMesa uses something like the

former on attribute indexes to coordinate parallel scanners on separate

index and record tables.

Thoughts? I'm inclined to keep the implementation to the bare minimum

necessary for the basic use cases (lat/lon and bbox queries) though I do

think a general dimensionality reducing lexicoder would be very useful.

On Fri, Jul 25, 2014 at 6:51 PM, Chris Bennight <[EMAIL PROTECTED]> wrote:

single lexicographic dimension - as it would cover most basic use cases.

Precision (number of bits in encoding) could be an arg or a config

variable. For WITHIN/INTERSECTS topological predicates, we need to

decompose the query geometry into the (possibly/probably non-contiguous) 1D

ranges that cover the region in question. GeoMesa has an algorithm to

quickly perform this decomposition that computes the minimum number of

geohash prefixes at different resolutions to fully cover the query

polygon. Basically, it recursively traverses through levels of geohash

resolutions, prioritizing rectangles that intersect the region and

discarding non-intersecting rectangles at the lowest precisions, to produce

a sequence of ranges to scan over. Fully contained rectangles are

discovered at their lowest resolution at which point the algorithm pops the

stack and searches the next prioritized prefix. I think something like

this would definitely need to be ported over and included in a lexicoder

implementation to make it useful. Also, rather than materialize the entire

set of ranges in memory, we can either return a lazy iterator of prefixes

that can be fed into a scanner in batches or we can have a short-circuit

config that tunes the amount of slop that's tolerable and cuts off the

traversal at a certain level of precision. GeoMesa uses something like the

former on attribute indexes to coordinate parallel scanners on separate

index and record tables.

Thoughts? I'm inclined to keep the implementation to the bare minimum

necessary for the basic use cases (lat/lon and bbox queries) though I do

think a general dimensionality reducing lexicoder would be very useful.

On Fri, Jul 25, 2014 at 6:51 PM, Chris Bennight <[EMAIL PROTECTED]> wrote:

As several people have commented, a single range for a query can produce a

lot of false positives that need to be filtered out. I had made this

visualization a while back that lets you interactively (click-drag a

bounding box) see this behavior.

http://bl.ocks.org/jaredwinick/5073432

On Sun, Jul 27, 2014 at 2:15 PM, Anthony Fox <[EMAIL PROTECTED]> wrote:

lot of false positives that need to be filtered out. I had made this

visualization a while back that lets you interactively (click-drag a

bounding box) see this behavior.

http://bl.ocks.org/jaredwinick/5073432

On Sun, Jul 27, 2014 at 2:15 PM, Anthony Fox <[EMAIL PROTECTED]> wrote:

lot of false positives that need to be filtered out. I had made this

visualization a while back that lets you interactively (click-drag a

bounding box) see this behavior.

http://bl.ocks.org/jaredwinick/5073432

On Sun, Jul 27, 2014 at 2:15 PM, Anthony Fox <[EMAIL PROTECTED]> wrote:

On Calrissian, we had once considered making a lexicoder for lat/long that

really transformed the two-dimensions (lat/lon) down into a geohash based

on the z-curve.

The reason we decided against a first class data-type for this is exactly

the same reason that Anthony brings up in his previous comment- the geohash

only makes sense as a query tool when you have a good way to structure (at

least as many as possible) the non-sequential ranges you would need in

order to find all the overlapping regions (with minimal as possible amount

of false-positives).

On Mon, Jul 28, 2014 at 2:00 PM, Jared Winick <[EMAIL PROTECTED]> wrote:

really transformed the two-dimensions (lat/lon) down into a geohash based

on the z-curve.

The reason we decided against a first class data-type for this is exactly

the same reason that Anthony brings up in his previous comment- the geohash

only makes sense as a query tool when you have a good way to structure (at

least as many as possible) the non-sequential ranges you would need in

order to find all the overlapping regions (with minimal as possible amount

of false-positives).

On Mon, Jul 28, 2014 at 2:00 PM, Jared Winick <[EMAIL PROTECTED]> wrote:

On Calrissian, we had once considered making a lexicoder for lat/long that

really transformed the two-dimensions (lat/lon) down into a geohash based

on the z-curve.

The reason we decided against a first class data-type for this is exactly

the same reason that Anthony brings up in his previous comment- the geohash

only makes sense as a query tool when you have a good way to structure (at

least as many as possible) the non-sequential ranges you would need in

order to find all the overlapping regions (with minimal as possible amount

of false-positives).

On Mon, Jul 28, 2014 at 2:00 PM, Jared Winick <[EMAIL PROTECTED]> wrote:

really transformed the two-dimensions (lat/lon) down into a geohash based

on the z-curve.

The reason we decided against a first class data-type for this is exactly

the same reason that Anthony brings up in his previous comment- the geohash

only makes sense as a query tool when you have a good way to structure (at

least as many as possible) the non-sequential ranges you would need in

order to find all the overlapping regions (with minimal as possible amount

of false-positives).

On Mon, Jul 28, 2014 at 2:00 PM, Jared Winick <[EMAIL PROTECTED]> wrote:

Anthony -

So that sounds reasonable - it's sounding like the consensus is a basic

WGS84 (lat/long) type (point only), range query (lower left, upper right),

and a decompose fully method (i.e. only intersecting ranges). (caveat

here that we probably want to cap the precision here, as a full

decomposition of, say, a 31/31 bit z-curve might take a long time).

What's the performance of your query like for a 24bit/24bit (x,y) z-curve

on a full decomposition over a significant portion of the range? I'm

thinking in the interest of keeping it simple/easy it might be best just to

fix the precision and document it. 24 bits at the equator would still give

us a resolution of ~2m (should be worst case) which should be good enough

for most things. (basically 0.00002 would be the max granularity for

longitude).

If it's fast enough I would just cap at 64 (well, maybe 63 for signed

issues) total bits to to keep the key as a long = 31bits/31bits gives a

longitudinal granularity of ~0.00000017 - or 1.9cm at the equator. I think

a full decomposition on a 31/31bit grid might be expensive.

(the values above are assuming we normalization of the input value - which

I think is a given for this implementation?)

If we can fix the precision then that takes out most of the variables and

it's pretty straightforward - I think decomp performance is the main

limiting factor.

Jared -

Yep, exactly that - which would be unintuitive behaviour (Getting back

ranges that didn't really match). That said, an inclusive only range set

can be generated; you can model it as a quadtree decomposition and proceed

recursively if your current region ("quad") isn't fully covered by your

selection window - otherwise return the range of your current "quad".

Which, I think, is essentially what Anthony described in his

implementation (though with stack collection instead of recursive; probably

for the best there).

Corey -

So you think the lexicoder would make sense/be reasonable if there is a

performant range query tool to go with it (one that resulted in 0 false

positives)?

On Mon, Jul 28, 2014 at 5:10 PM, Corey Nolet <[EMAIL PROTECTED]> wrote:

So that sounds reasonable - it's sounding like the consensus is a basic

WGS84 (lat/long) type (point only), range query (lower left, upper right),

and a decompose fully method (i.e. only intersecting ranges). (caveat

here that we probably want to cap the precision here, as a full

decomposition of, say, a 31/31 bit z-curve might take a long time).

What's the performance of your query like for a 24bit/24bit (x,y) z-curve

on a full decomposition over a significant portion of the range? I'm

thinking in the interest of keeping it simple/easy it might be best just to

fix the precision and document it. 24 bits at the equator would still give

us a resolution of ~2m (should be worst case) which should be good enough

for most things. (basically 0.00002 would be the max granularity for

longitude).

If it's fast enough I would just cap at 64 (well, maybe 63 for signed

issues) total bits to to keep the key as a long = 31bits/31bits gives a

longitudinal granularity of ~0.00000017 - or 1.9cm at the equator. I think

a full decomposition on a 31/31bit grid might be expensive.

(the values above are assuming we normalization of the input value - which

I think is a given for this implementation?)

If we can fix the precision then that takes out most of the variables and

it's pretty straightforward - I think decomp performance is the main

limiting factor.

Jared -

Yep, exactly that - which would be unintuitive behaviour (Getting back

ranges that didn't really match). That said, an inclusive only range set

can be generated; you can model it as a quadtree decomposition and proceed

recursively if your current region ("quad") isn't fully covered by your

selection window - otherwise return the range of your current "quad".

Which, I think, is essentially what Anthony described in his

implementation (though with stack collection instead of recursive; probably

for the best there).

Corey -

So you think the lexicoder would make sense/be reasonable if there is a

performant range query tool to go with it (one that resulted in 0 false

positives)?

On Mon, Jul 28, 2014 at 5:10 PM, Corey Nolet <[EMAIL PROTECTED]> wrote:

So that sounds reasonable - it's sounding like the consensus is a basic

WGS84 (lat/long) type (point only), range query (lower left, upper right),

and a decompose fully method (i.e. only intersecting ranges). (caveat

here that we probably want to cap the precision here, as a full

decomposition of, say, a 31/31 bit z-curve might take a long time).

What's the performance of your query like for a 24bit/24bit (x,y) z-curve

on a full decomposition over a significant portion of the range? I'm

thinking in the interest of keeping it simple/easy it might be best just to

fix the precision and document it. 24 bits at the equator would still give

us a resolution of ~2m (should be worst case) which should be good enough

for most things. (basically 0.00002 would be the max granularity for

longitude).

If it's fast enough I would just cap at 64 (well, maybe 63 for signed

issues) total bits to to keep the key as a long = 31bits/31bits gives a

longitudinal granularity of ~0.00000017 - or 1.9cm at the equator. I think

a full decomposition on a 31/31bit grid might be expensive.

(the values above are assuming we normalization of the input value - which

I think is a given for this implementation?)

If we can fix the precision then that takes out most of the variables and

it's pretty straightforward - I think decomp performance is the main

limiting factor.

Jared -

Yep, exactly that - which would be unintuitive behaviour (Getting back

ranges that didn't really match). That said, an inclusive only range set

can be generated; you can model it as a quadtree decomposition and proceed

recursively if your current region ("quad") isn't fully covered by your

selection window - otherwise return the range of your current "quad".

Which, I think, is essentially what Anthony described in his

implementation (though with stack collection instead of recursive; probably

for the best there).

Corey -

So you think the lexicoder would make sense/be reasonable if there is a

performant range query tool to go with it (one that resulted in 0 false

positives)?

On Mon, Jul 28, 2014 at 5:10 PM, Corey Nolet <[EMAIL PROTECTED]> wrote:

So I put together a quick mockup @

https://github.com/chrisbennight/wgs84lexicoder of what this might work

like. It's functional, but I haven't done more than a cursory test, so

caveat emptor.

On the performance side, a full decomposition (no slop) (on a hilbert

curve) crossed the 1 second mark at between 18 and 19 bits precision. This

is on an i7-920. At 18 bits an average error (where error is the

deviation after an encode/decode cycle due to quantization effects)

measurement for longitude was at 0.0006639 degrees, or roughly 73meters at

the equator. This should be the worst error you will see latitude was on

average ~ 1/2 the error by degrees (since it had a smaller range for the

same number of bits - -90 to +90 vs. -180 to +180).

Anyway, this was just a simplistic stab to ground the discussion. The

performance of the full decomposition was worse than I expected, so that

probably places some constraints on what defaults should be. Here's the

output of the test I wrote:

2 bits took 0.030 seconds for 1 ranges

Actual latitude error: 30.7720588

Worst case avg longitude error: 60.3973510

3 bits took 0.003 seconds for 3 ranges

Actual latitude error: 12.8098739

Worst case avg longitude error: 25.8845790

4 bits took 0.002 seconds for 6 ranges

Actual latitude error: 5.3602941

Worst case avg longitude error: 10.8079470

5 bits took 0.003 seconds for 13 ranges

Actual latitude error: 2.8071632

Worst case avg longitude error: 5.6910916

6 bits took 0.012 seconds for 46 ranges

Actual latitude error: 1.3392857

Worst case avg longitude error: 2.7625355

7 bits took 0.025 seconds for 73 ranges

Actual latitude error: 0.6852131

Worst case avg longitude error: 1.3516191

8 bits took 0.074 seconds for 313 ranges

Actual latitude error: 0.3153114

Worst case avg longitude error: 0.6264122

9 bits took 0.037 seconds for 419 ranges

Actual latitude error: 0.1702976

Worst case avg longitude error: 0.3452521

10 bits took 0.040 seconds for 910 ranges

Actual latitude error: 0.0850656

Worst case avg longitude error: 0.1724573

11 bits took 0.040 seconds for 1906 ranges

Actual latitude error: 0.0438051

Worst case avg longitude error: 0.0861865

12 bits took 0.035 seconds for 6335 ranges

Actual latitude error: 0.0144635

Worst case avg longitude error: 0.0343498

13 bits took 0.034 seconds for 7957 ranges

Actual latitude error: 0.0090083

Worst case avg longitude error: 0.0215387

14 bits took 0.070 seconds for 18518 ranges

Actual latitude error: 0.0056349

Worst case avg longitude error: 0.0117874

15 bits took 0.132 seconds for 33670 ranges

Actual latitude error: 0.0027366

Worst case avg longitude error: 0.0055297

16 bits took 0.333 seconds for 83506 ranges

Actual latitude error: 0.0012269

Worst case avg longitude error: 0.0024738

17 bits took 0.651 seconds for 167546 ranges

Actual latitude error: 0.0006639

Worst case avg longitude error: 0.0013460

18 bits took 0.773 seconds for 198597 ranges

Actual latitude error: 0.0003219

Worst case avg longitude error: 0.0006639

19 bits took 1.329 seconds for 335470 ranges

Actual latitude error: 0.0001660

Worst case avg longitude error: 0.0003274

20 bits took 5.268 seconds for 1049209 ranges

Actual latitude error: 0.0000767

Worst case avg longitude error: 0.0001523

21 bits took 6.735 seconds for 1433559 ranges

Actual latitude error: 0.0000415

Worst case avg longitude error: 0.0000841

22 bits took 29.073 seconds for 5371949 ranges

Actual latitude error: 0.0000207

Worst case avg longitude error: 0.0000421

23 bits took 44.222 seconds for 7681223 ranges

Actual latitude error: 0.0000107

Worst case avg longitude error: 0.0000210

24 bits took 72.172 seconds for 11490434 ranges

Actual latitude error: 0.0000035

Worst case avg longitude error: 0.0000084

On Mon, Jul 28, 2014 at 7:59 PM, Chris Bennight <[EMAIL PROTECTED]> wrote:

https://github.com/chrisbennight/wgs84lexicoder of what this might work

like. It's functional, but I haven't done more than a cursory test, so

caveat emptor.

On the performance side, a full decomposition (no slop) (on a hilbert

curve) crossed the 1 second mark at between 18 and 19 bits precision. This

is on an i7-920. At 18 bits an average error (where error is the

deviation after an encode/decode cycle due to quantization effects)

measurement for longitude was at 0.0006639 degrees, or roughly 73meters at

the equator. This should be the worst error you will see latitude was on

average ~ 1/2 the error by degrees (since it had a smaller range for the

same number of bits - -90 to +90 vs. -180 to +180).

Anyway, this was just a simplistic stab to ground the discussion. The

performance of the full decomposition was worse than I expected, so that

probably places some constraints on what defaults should be. Here's the

output of the test I wrote:

2 bits took 0.030 seconds for 1 ranges

Actual latitude error: 30.7720588

Worst case avg longitude error: 60.3973510

3 bits took 0.003 seconds for 3 ranges

Actual latitude error: 12.8098739

Worst case avg longitude error: 25.8845790

4 bits took 0.002 seconds for 6 ranges

Actual latitude error: 5.3602941

Worst case avg longitude error: 10.8079470

5 bits took 0.003 seconds for 13 ranges

Actual latitude error: 2.8071632

Worst case avg longitude error: 5.6910916

6 bits took 0.012 seconds for 46 ranges

Actual latitude error: 1.3392857

Worst case avg longitude error: 2.7625355

7 bits took 0.025 seconds for 73 ranges

Actual latitude error: 0.6852131

Worst case avg longitude error: 1.3516191

8 bits took 0.074 seconds for 313 ranges

Actual latitude error: 0.3153114

Worst case avg longitude error: 0.6264122

9 bits took 0.037 seconds for 419 ranges

Actual latitude error: 0.1702976

Worst case avg longitude error: 0.3452521

10 bits took 0.040 seconds for 910 ranges

Actual latitude error: 0.0850656

Worst case avg longitude error: 0.1724573

11 bits took 0.040 seconds for 1906 ranges

Actual latitude error: 0.0438051

Worst case avg longitude error: 0.0861865

12 bits took 0.035 seconds for 6335 ranges

Actual latitude error: 0.0144635

Worst case avg longitude error: 0.0343498

13 bits took 0.034 seconds for 7957 ranges

Actual latitude error: 0.0090083

Worst case avg longitude error: 0.0215387

14 bits took 0.070 seconds for 18518 ranges

Actual latitude error: 0.0056349

Worst case avg longitude error: 0.0117874

15 bits took 0.132 seconds for 33670 ranges

Actual latitude error: 0.0027366

Worst case avg longitude error: 0.0055297

16 bits took 0.333 seconds for 83506 ranges

Actual latitude error: 0.0012269

Worst case avg longitude error: 0.0024738

17 bits took 0.651 seconds for 167546 ranges

Actual latitude error: 0.0006639

Worst case avg longitude error: 0.0013460

18 bits took 0.773 seconds for 198597 ranges

Actual latitude error: 0.0003219

Worst case avg longitude error: 0.0006639

19 bits took 1.329 seconds for 335470 ranges

Actual latitude error: 0.0001660

Worst case avg longitude error: 0.0003274

20 bits took 5.268 seconds for 1049209 ranges

Actual latitude error: 0.0000767

Worst case avg longitude error: 0.0001523

21 bits took 6.735 seconds for 1433559 ranges

Actual latitude error: 0.0000415

Worst case avg longitude error: 0.0000841

22 bits took 29.073 seconds for 5371949 ranges

Actual latitude error: 0.0000207

Worst case avg longitude error: 0.0000421

23 bits took 44.222 seconds for 7681223 ranges

Actual latitude error: 0.0000107

Worst case avg longitude error: 0.0000210

24 bits took 72.172 seconds for 11490434 ranges

Actual latitude error: 0.0000035

Worst case avg longitude error: 0.0000084

On Mon, Jul 28, 2014 at 7:59 PM, Chris Bennight <[EMAIL PROTECTED]> wrote:

https://github.com/chrisbennight/wgs84lexicoder of what this might work

like. It's functional, but I haven't done more than a cursory test, so

caveat emptor.

On the performance side, a full decomposition (no slop) (on a hilbert

curve) crossed the 1 second mark at between 18 and 19 bits precision. This

is on an i7-920. At 18 bits an average error (where error is the

deviation after an encode/decode cycle due to quantization effects)

measurement for longitude was at 0.0006639 degrees, or roughly 73meters at

the equator. This should be the worst error you will see latitude was on

average ~ 1/2 the error by degrees (since it had a smaller range for the

same number of bits - -90 to +90 vs. -180 to +180).

Anyway, this was just a simplistic stab to ground the discussion. The

performance of the full decomposition was worse than I expected, so that

probably places some constraints on what defaults should be. Here's the

output of the test I wrote:

2 bits took 0.030 seconds for 1 ranges

Actual latitude error: 30.7720588

Worst case avg longitude error: 60.3973510

3 bits took 0.003 seconds for 3 ranges

Actual latitude error: 12.8098739

Worst case avg longitude error: 25.8845790

4 bits took 0.002 seconds for 6 ranges

Actual latitude error: 5.3602941

Worst case avg longitude error: 10.8079470

5 bits took 0.003 seconds for 13 ranges

Actual latitude error: 2.8071632

Worst case avg longitude error: 5.6910916

6 bits took 0.012 seconds for 46 ranges

Actual latitude error: 1.3392857

Worst case avg longitude error: 2.7625355

7 bits took 0.025 seconds for 73 ranges

Actual latitude error: 0.6852131

Worst case avg longitude error: 1.3516191

8 bits took 0.074 seconds for 313 ranges

Actual latitude error: 0.3153114

Worst case avg longitude error: 0.6264122

9 bits took 0.037 seconds for 419 ranges

Actual latitude error: 0.1702976

Worst case avg longitude error: 0.3452521

10 bits took 0.040 seconds for 910 ranges

Actual latitude error: 0.0850656

Worst case avg longitude error: 0.1724573

11 bits took 0.040 seconds for 1906 ranges

Actual latitude error: 0.0438051

Worst case avg longitude error: 0.0861865

12 bits took 0.035 seconds for 6335 ranges

Actual latitude error: 0.0144635

Worst case avg longitude error: 0.0343498

13 bits took 0.034 seconds for 7957 ranges

Actual latitude error: 0.0090083

Worst case avg longitude error: 0.0215387

14 bits took 0.070 seconds for 18518 ranges

Actual latitude error: 0.0056349

Worst case avg longitude error: 0.0117874

15 bits took 0.132 seconds for 33670 ranges

Actual latitude error: 0.0027366

Worst case avg longitude error: 0.0055297

16 bits took 0.333 seconds for 83506 ranges

Actual latitude error: 0.0012269

Worst case avg longitude error: 0.0024738

17 bits took 0.651 seconds for 167546 ranges

Actual latitude error: 0.0006639

Worst case avg longitude error: 0.0013460

18 bits took 0.773 seconds for 198597 ranges

Actual latitude error: 0.0003219

Worst case avg longitude error: 0.0006639

19 bits took 1.329 seconds for 335470 ranges

Actual latitude error: 0.0001660

Worst case avg longitude error: 0.0003274

20 bits took 5.268 seconds for 1049209 ranges

Actual latitude error: 0.0000767

Worst case avg longitude error: 0.0001523

21 bits took 6.735 seconds for 1433559 ranges

Actual latitude error: 0.0000415

Worst case avg longitude error: 0.0000841

22 bits took 29.073 seconds for 5371949 ranges

Actual latitude error: 0.0000207

Worst case avg longitude error: 0.0000421

23 bits took 44.222 seconds for 7681223 ranges

Actual latitude error: 0.0000107

Worst case avg longitude error: 0.0000210

24 bits took 72.172 seconds for 11490434 ranges

Actual latitude error: 0.0000035

Worst case avg longitude error: 0.0000084

On Mon, Jul 28, 2014 at 7:59 PM, Chris Bennight <[EMAIL PROTECTED]> wrote:

Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.

Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.

Service operated by Sematext

Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.

Service operated by Sematext