Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo, mail # dev - Setting Charset in getBytes() call.


+
David Medinets 2012-10-28, 21:50
+
Ed Kohlwey 2012-10-28, 22:18
+
William Slacum 2012-10-29, 15:39
+
David Medinets 2012-10-29, 16:00
+
Josh Elser 2012-10-29, 16:21
+
Benson Margulies 2012-10-29, 16:24
+
John Vines 2012-10-29, 16:42
+
Josh Elser 2012-10-29, 16:57
+
David Medinets 2012-10-29, 17:00
+
William Slacum 2012-10-29, 17:13
+
Mike Drob 2012-10-29, 17:16
+
Michael Flester 2012-10-29, 19:14
+
John Vines 2012-10-29, 19:18
+
Benson Margulies 2012-10-29, 20:02
Copy link to this message
-
Re: Setting Charset in getBytes() call.
David Medinets 2012-10-29, 20:29
Anytime that I've encountered non-English character sets, the answer
has been to use UTF-8. I'm moving forward with that assumption since
it is safe change. If the group decides to use a different default
encoding, it will be trivial to build on the work that I've done
identifying getBytes() calls. I will post a list of files and my
methodology before a svn checkin.

On Mon, Oct 29, 2012 at 4:02 PM, Benson Margulies <[EMAIL PROTECTED]> wrote:
> On Mon, Oct 29, 2012 at 3:18 PM, John Vines <[EMAIL PROTECTED]> wrote:
>> So perhaps we should have ISO-8859-1 as the standard. Mike- do you see any
>> reason to use something beside ISO-8859-1 for the encodings?
>
> I object and caution against *any* plan that involves transcoding from
> X to UTF-16 and back where when the data is not always going to be
> valid bytes of encoding X. The only clean solution here is to have an
> API entirely in terms of bytes, and either let the user do getBytes if
> they want to store string data, or provide additional API.
>
>
>
>>
>> John
>>
>> On Mon, Oct 29, 2012 at 3:14 PM, Michael Flester <[EMAIL PROTECTED]> wrote:
>>
>>> > UTF-8 should always be present (according to the JLS), and as a
>>> multi-byte
>>> > format should be able to encode any character that you would need to.
>>> >
>>>
>>> UTF-8 cannot encode arbitrary data. All data that we store in accumulo
>>> is not characters. A safe encoding to use as a pass through when you
>>> don't know if you are dealing with characters is ISO-8859-1 since we know
>>> that we can make the round trip from bytes to string to bytes without loss.
>>>
+
Michael Flester 2012-10-30, 00:27
+
Josh Elser 2012-10-30, 00:46
+
Benson Margulies 2012-10-30, 00:54
+
Josh Elser 2012-10-30, 01:57
+
John Vines 2012-10-30, 02:08
+
David Medinets 2012-10-30, 02:47
+
Josh Elser 2012-10-30, 22:27
+
David Medinets 2012-10-30, 23:47
+
Josh Elser 2012-10-31, 00:21
+
Benson Margulies 2012-10-31, 00:31
+
William Slacum 2012-10-31, 00:41
+
David Medinets 2012-10-31, 02:29
+
John Vines 2012-10-31, 02:35
+
Christopher Tubbs 2012-10-31, 18:02
+
Marc Parisi 2012-11-02, 12:24
+
Benson Margulies 2012-11-02, 19:56
+
John Vines 2012-11-02, 20:18
+
Christopher Tubbs 2012-11-03, 01:54
+
David Medinets 2012-11-03, 03:34
+
Josh Elser 2012-11-02, 23:34
+
Drew Farris 2012-10-30, 01:22
+
Adam Fuchs 2012-10-30, 20:26
+
Ed Kohlwey 2012-10-30, 01:44
+
Ed Kohlwey 2012-10-30, 01:54
+
Eric Newton 2012-10-30, 20:02
+
Marc Parisi 2012-10-30, 22:28
+
Marc Parisi 2012-10-30, 22:31
+
Benson Margulies 2012-10-30, 23:26