Tom, another approach you could take would be to store an ASCII encoded version of the string as the row key or column qualifier, and then the full UTF-8 string elsewhere (e.g. in the cell value, or even later in the row key). That wouldn't work out the fine sorting (whether "è" sorts before or after "e") but it would solve the gross sorting ("è" would always come before "f"). If you need true UTF-8 collation in the results, you could then implement it as a layer on top of that (in your app, or maybe a co-processor, I'm not sure about the latter). But at least with this approach, you'd be able to take advantage of rowkey ranges in your scans, which would probably make up for any time spent doing a secondary sort.
On Jun 8, 2012, at 12:34 PM, Tom Brown wrote:
> Storing the bytes as native UTF-16 or UTF-32 will not help. Even
> strings in UTF-8 format can be sorted by their code points when stored
> as bytes. Unfortunately, that's not really useful for collation as
> characters like "è" (U+00E8) should appear between "e" (U+0065) and
> "f" (U+0066), but the code points to not allow this.
> Thanks anyway!
> On Fri, Jun 8, 2012 at 11:14 AM, Stack <[EMAIL PROTECTED]> wrote:
>> On Fri, Jun 8, 2012 at 9:35 AM, Tom Brown <[EMAIL PROTECTED]> wrote:
>>> Is there any way to control introduce a different ordering scheme from
>>> the base comparable bytes? My use case is that I am using UTF-8 data
>>> for my keys, and I would like to have scans use UTF-8 collation.
>>> Could this be done by providing an alternate implementation of
>>> Thanks in advance!
>> Unfortunately no Tom. The database is all sorted the same way.
>> Different sorts per table would complicate system interactions (the
>> catalog tables would have to change sort by table). It might be
>> doable but it would take some work.
>> Can you store your data UTF-16 or UTF-32? Its a while since I dealt
>> w/ this stuff but IIRC, their sort order is byte order? (WARNING! I
>> could be way off here).