Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # dev - Setting Charset in getBytes() call.


Copy link to this message
-
Re: Setting Charset in getBytes() call.
David Medinets 2012-10-29, 20:29
Anytime that I've encountered non-English character sets, the answer
has been to use UTF-8. I'm moving forward with that assumption since
it is safe change. If the group decides to use a different default
encoding, it will be trivial to build on the work that I've done
identifying getBytes() calls. I will post a list of files and my
methodology before a svn checkin.

On Mon, Oct 29, 2012 at 4:02 PM, Benson Margulies <[EMAIL PROTECTED]> wrote:
> On Mon, Oct 29, 2012 at 3:18 PM, John Vines <[EMAIL PROTECTED]> wrote:
>> So perhaps we should have ISO-8859-1 as the standard. Mike- do you see any
>> reason to use something beside ISO-8859-1 for the encodings?
>
> I object and caution against *any* plan that involves transcoding from
> X to UTF-16 and back where when the data is not always going to be
> valid bytes of encoding X. The only clean solution here is to have an
> API entirely in terms of bytes, and either let the user do getBytes if
> they want to store string data, or provide additional API.
>
>
>
>>
>> John
>>
>> On Mon, Oct 29, 2012 at 3:14 PM, Michael Flester <[EMAIL PROTECTED]> wrote:
>>
>>> > UTF-8 should always be present (according to the JLS), and as a
>>> multi-byte
>>> > format should be able to encode any character that you would need to.
>>> >
>>>
>>> UTF-8 cannot encode arbitrary data. All data that we store in accumulo
>>> is not characters. A safe encoding to use as a pass through when you
>>> don't know if you are dealing with characters is ISO-8859-1 since we know
>>> that we can make the round trip from bytes to string to bytes without loss.
>>>