|
David Medinets
2012-10-28, 21:50
Ed Kohlwey
2012-10-28, 22:18
William Slacum
2012-10-29, 15:39
David Medinets
2012-10-29, 16:00
Josh Elser
2012-10-29, 16:21
Benson Margulies
2012-10-29, 16:24
John Vines
2012-10-29, 16:42
Josh Elser
2012-10-29, 16:57
David Medinets
2012-10-29, 17:00
William Slacum
2012-10-29, 17:13
Mike Drob
2012-10-29, 17:16
Michael Flester
2012-10-29, 19:14
John Vines
2012-10-29, 19:18
Benson Margulies
2012-10-29, 20:02
David Medinets
2012-10-29, 20:29
Michael Flester
2012-10-30, 00:27
Josh Elser
2012-10-30, 00:46
Benson Margulies
2012-10-30, 00:54
Josh Elser
2012-10-30, 01:57
John Vines
2012-10-30, 02:08
David Medinets
2012-10-30, 02:47
Josh Elser
2012-10-30, 22:27
David Medinets
2012-10-30, 23:47
Josh Elser
2012-10-31, 00:21
Benson Margulies
2012-10-31, 00:31
William Slacum
2012-10-31, 00:41
David Medinets
2012-10-31, 02:29
John Vines
2012-10-31, 02:35
Christopher Tubbs
2012-10-31, 18:02
Marc Parisi
2012-11-02, 12:24
Benson Margulies
2012-11-02, 19:56
John Vines
2012-11-02, 20:18
Christopher Tubbs
2012-11-03, 01:54
David Medinets
2012-11-03, 03:34
Josh Elser
2012-11-02, 23:34
Drew Farris
2012-10-30, 01:22
Adam Fuchs
2012-10-30, 20:26
Ed Kohlwey
2012-10-30, 01:44
Ed Kohlwey
2012-10-30, 01:54
Eric Newton
2012-10-30, 20:02
Marc Parisi
2012-10-30, 22:28
Marc Parisi
2012-10-30, 22:31
Benson Margulies
2012-10-30, 23:26
|
-
Setting Charset in getBytes() call.David Medinets 2012-10-28, 21:50
https://issues.apache.org/jira/browse/ACCUMULO-241?focusedCommentId=13449680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13449680
In this comment, John mentioned that all getBytes() method calls should be changed to use UTF8. There are about 1,800 getBytes() calls and not all of them involve String objects. I am working on ways to identify a subset of these calls to change. I have created https://issues.apache.org/jira/browse/ACCUMULO-836 to track this issue. Should we create one static Charset object? Class AccumuloDefaultCharset { public static Charset UTF8 = Charset.forName("UTF8"); } Should we use a static constant? public static String UTF8 = "UTF8"; I have found one instance of getBytes() in InputFormatBase: protected static byte[] getPassword(Configuration conf) { return Base64.decodeBase64(conf.get(PASSWORD, "").getBytes()); } Are there any reasons why I can't start specifying the charset? Is UTF8 the right Charset to use? I am not an expert in non-English charsets, so guidance would be welcome. +
David Medinets 2012-10-28, 21:50
-
Re: Setting Charset in getBytes() call.Ed Kohlwey 2012-10-28, 22:18
If you use a private static field in each class for the charset, it will
basically be a singleton because charsets are cached in char set.forname. IMHO this is a somewhat cleaner approach than having lots of static imports to utility classes with lots of constants in them. On Oct 28, 2012 5:50 PM, "David Medinets" <[EMAIL PROTECTED]> wrote: > > https://issues.apache.org/jira/browse/ACCUMULO-241?focusedCommentId=13449680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13449680 > > In this comment, John mentioned that all getBytes() method calls > should be changed to use UTF8. There are about 1,800 getBytes() calls > and not all of them involve String objects. I am working on ways to > identify a subset of these calls to change. > > I have created https://issues.apache.org/jira/browse/ACCUMULO-836 to > track this issue. > > Should we create one static Charset object? > > Class AccumuloDefaultCharset { > public static Charset UTF8 = Charset.forName("UTF8"); > } > > Should we use a static constant? > > public static String UTF8 = "UTF8"; > > I have found one instance of getBytes() in InputFormatBase: > > protected static byte[] getPassword(Configuration conf) { > return Base64.decodeBase64(conf.get(PASSWORD, "").getBytes()); > } > > Are there any reasons why I can't start specifying the charset? Is > UTF8 the right Charset to use? I am not an expert in non-English > charsets, so guidance would be welcome. > +
Ed Kohlwey 2012-10-28, 22:18
-
Re: Setting Charset in getBytes() call.William Slacum 2012-10-29, 15:39
Isn't it easier to just set the JVM property `file.encoding`?
On Sun, Oct 28, 2012 at 3:18 PM, Ed Kohlwey <[EMAIL PROTECTED]> wrote: > If you use a private static field in each class for the charset, it will > basically be a singleton because charsets are cached in char set.forname. > IMHO this is a somewhat cleaner approach than having lots of static imports > to utility classes with lots of constants in them. > On Oct 28, 2012 5:50 PM, "David Medinets" <[EMAIL PROTECTED]> > wrote: > > > > > > https://issues.apache.org/jira/browse/ACCUMULO-241?focusedCommentId=13449680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13449680 > > > > In this comment, John mentioned that all getBytes() method calls > > should be changed to use UTF8. There are about 1,800 getBytes() calls > > and not all of them involve String objects. I am working on ways to > > identify a subset of these calls to change. > > > > I have created https://issues.apache.org/jira/browse/ACCUMULO-836 to > > track this issue. > > > > Should we create one static Charset object? > > > > Class AccumuloDefaultCharset { > > public static Charset UTF8 = Charset.forName("UTF8"); > > } > > > > Should we use a static constant? > > > > public static String UTF8 = "UTF8"; > > > > I have found one instance of getBytes() in InputFormatBase: > > > > protected static byte[] getPassword(Configuration conf) { > > return Base64.decodeBase64(conf.get(PASSWORD, "").getBytes()); > > } > > > > Are there any reasons why I can't start specifying the charset? Is > > UTF8 the right Charset to use? I am not an expert in non-English > > charsets, so guidance would be welcome. > > > +
William Slacum 2012-10-29, 15:39
-
Re: Setting Charset in getBytes() call.David Medinets 2012-10-29, 16:00
I like the idea of making the change explicit in the source code.
Setting the encoding in the jvm property would be easier but not as explicit. I have a few dozen of the files changed. Today I have free time since Hurricane Sandy has closed offices. On Mon, Oct 29, 2012 at 11:39 AM, William Slacum <[EMAIL PROTECTED]> wrote: > Isn't it easier to just set the JVM property `file.encoding`? > > On Sun, Oct 28, 2012 at 3:18 PM, Ed Kohlwey <[EMAIL PROTECTED]> wrote: > >> If you use a private static field in each class for the charset, it will >> basically be a singleton because charsets are cached in char set.forname. >> IMHO this is a somewhat cleaner approach than having lots of static imports >> to utility classes with lots of constants in them. >> On Oct 28, 2012 5:50 PM, "David Medinets" <[EMAIL PROTECTED]> >> wrote: >> >> > >> > >> https://issues.apache.org/jira/browse/ACCUMULO-241?focusedCommentId=13449680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13449680 >> > >> > In this comment, John mentioned that all getBytes() method calls >> > should be changed to use UTF8. There are about 1,800 getBytes() calls >> > and not all of them involve String objects. I am working on ways to >> > identify a subset of these calls to change. >> > >> > I have created https://issues.apache.org/jira/browse/ACCUMULO-836 to >> > track this issue. >> > >> > Should we create one static Charset object? >> > >> > Class AccumuloDefaultCharset { >> > public static Charset UTF8 = Charset.forName("UTF8"); >> > } >> > >> > Should we use a static constant? >> > >> > public static String UTF8 = "UTF8"; >> > >> > I have found one instance of getBytes() in InputFormatBase: >> > >> > protected static byte[] getPassword(Configuration conf) { >> > return Base64.decodeBase64(conf.get(PASSWORD, "").getBytes()); >> > } >> > >> > Are there any reasons why I can't start specifying the charset? Is >> > UTF8 the right Charset to use? I am not an expert in non-English >> > charsets, so guidance would be welcome. >> > >> +
David Medinets 2012-10-29, 16:00
-
Re: Setting Charset in getBytes() call.Josh Elser 2012-10-29, 16:21
David, I beg to differ.
Setting it via the JVM property is a single change to make, whereas if you change every single usage of getBytes(), you now forced the next person to branch the code, change everything to UTF16 (hypothetical use case) and continue a diverged codebase forever. I would say that the reason that such a JVM property exists is to alleviate you from having to make these code changes in the first place. On 10/29/2012 12:00 PM, David Medinets wrote: > I like the idea of making the change explicit in the source code. > Setting the encoding in the jvm property would be easier but not as > explicit. I have a few dozen of the files changed. Today I have free > time since Hurricane Sandy has closed offices. > > On Mon, Oct 29, 2012 at 11:39 AM, William Slacum > <[EMAIL PROTECTED]> wrote: >> Isn't it easier to just set the JVM property `file.encoding`? >> >> On Sun, Oct 28, 2012 at 3:18 PM, Ed Kohlwey <[EMAIL PROTECTED]> wrote: >> >>> If you use a private static field in each class for the charset, it will >>> basically be a singleton because charsets are cached in char set.forname. >>> IMHO this is a somewhat cleaner approach than having lots of static imports >>> to utility classes with lots of constants in them. >>> On Oct 28, 2012 5:50 PM, "David Medinets" <[EMAIL PROTECTED]> >>> wrote: >>> >>>> >>>> >>> https://issues.apache.org/jira/browse/ACCUMULO-241?focusedCommentId=13449680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13449680 >>>> >>>> In this comment, John mentioned that all getBytes() method calls >>>> should be changed to use UTF8. There are about 1,800 getBytes() calls >>>> and not all of them involve String objects. I am working on ways to >>>> identify a subset of these calls to change. >>>> >>>> I have created https://issues.apache.org/jira/browse/ACCUMULO-836 to >>>> track this issue. >>>> >>>> Should we create one static Charset object? >>>> >>>> Class AccumuloDefaultCharset { >>>> public static Charset UTF8 = Charset.forName("UTF8"); >>>> } >>>> >>>> Should we use a static constant? >>>> >>>> public static String UTF8 = "UTF8"; >>>> >>>> I have found one instance of getBytes() in InputFormatBase: >>>> >>>> protected static byte[] getPassword(Configuration conf) { >>>> return Base64.decodeBase64(conf.get(PASSWORD, "").getBytes()); >>>> } >>>> >>>> Are there any reasons why I can't start specifying the charset? Is >>>> UTF8 the right Charset to use? I am not an expert in non-English >>>> charsets, so guidance would be welcome. >>>> >>> +
Josh Elser 2012-10-29, 16:21
-
Re: Setting Charset in getBytes() call.Benson Margulies 2012-10-29, 16:24
On Mon, Oct 29, 2012 at 12:21 PM, Josh Elser <[EMAIL PROTECTED]> wrote:
> David, I beg to differ. > > Setting it via the JVM property is a single change to make, whereas if you > change every single usage of getBytes(), you now forced the next person to > branch the code, change everything to UTF16 (hypothetical use case) and > continue a diverged codebase forever. Typically, the reason(s) that people don't take this approach are: a: a fear that other JVMs don't have this parameter, or don't have it under the same name. b: a desire to read or write files for uses in 'the platform encoding' whatever it is, in addition to whatever needs to be done in UTF-8. I'd be very surprised if Accumulo ever decided to do this sort of thing in UTF-16. > > I would say that the reason that such a JVM property exists is to alleviate > you from having to make these code changes in the first place. > > On 10/29/2012 12:00 PM, David Medinets wrote: >> >> I like the idea of making the change explicit in the source code. >> Setting the encoding in the jvm property would be easier but not as >> explicit. I have a few dozen of the files changed. Today I have free >> time since Hurricane Sandy has closed offices. >> >> On Mon, Oct 29, 2012 at 11:39 AM, William Slacum >> <[EMAIL PROTECTED]> wrote: >>> >>> Isn't it easier to just set the JVM property `file.encoding`? >>> >>> On Sun, Oct 28, 2012 at 3:18 PM, Ed Kohlwey <[EMAIL PROTECTED]> wrote: >>> >>>> If you use a private static field in each class for the charset, it will >>>> basically be a singleton because charsets are cached in char >>>> set.forname. >>>> IMHO this is a somewhat cleaner approach than having lots of static >>>> imports >>>> to utility classes with lots of constants in them. >>>> On Oct 28, 2012 5:50 PM, "David Medinets" <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>>> >>>>> >>>> >>>> https://issues.apache.org/jira/browse/ACCUMULO-241?focusedCommentId=13449680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13449680 >>>>> >>>>> >>>>> In this comment, John mentioned that all getBytes() method calls >>>>> should be changed to use UTF8. There are about 1,800 getBytes() calls >>>>> and not all of them involve String objects. I am working on ways to >>>>> identify a subset of these calls to change. >>>>> >>>>> I have created https://issues.apache.org/jira/browse/ACCUMULO-836 to >>>>> track this issue. >>>>> >>>>> Should we create one static Charset object? >>>>> >>>>> Class AccumuloDefaultCharset { >>>>> public static Charset UTF8 = Charset.forName("UTF8"); >>>>> } >>>>> >>>>> Should we use a static constant? >>>>> >>>>> public static String UTF8 = "UTF8"; >>>>> >>>>> I have found one instance of getBytes() in InputFormatBase: >>>>> >>>>> protected static byte[] getPassword(Configuration conf) { >>>>> return Base64.decodeBase64(conf.get(PASSWORD, "").getBytes()); >>>>> } >>>>> >>>>> Are there any reasons why I can't start specifying the charset? Is >>>>> UTF8 the right Charset to use? I am not an expert in non-English >>>>> charsets, so guidance would be welcome. >>>>> >>>> > +
Benson Margulies 2012-10-29, 16:24
-
Re: Setting Charset in getBytes() call.John Vines 2012-10-29, 16:42
Are there any experts when it comes to character encodings? First of all, I
would like to make sure there are no sacrifices being made by forcing UTF-8. >From there, if I think JVM properties is the way to go. Should there be ANY sort of shortfall with UTF-8, we should allow users to switch the encoding to the type of their pleasure. We can tweak the scripts to set the jvm property but still allow users to override should they need it in their setup. This allows us to not only avoid a massive code change, it also makes it easier for users to switch to an encoding should they have a need to. John On Mon, Oct 29, 2012 at 12:24 PM, Benson Margulies <[EMAIL PROTECTED]>wrote: > On Mon, Oct 29, 2012 at 12:21 PM, Josh Elser <[EMAIL PROTECTED]> wrote: > > David, I beg to differ. > > > > Setting it via the JVM property is a single change to make, whereas if > you > > change every single usage of getBytes(), you now forced the next person > to > > branch the code, change everything to UTF16 (hypothetical use case) and > > continue a diverged codebase forever. > > Typically, the reason(s) that people don't take this approach are: > > a: a fear that other JVMs don't have this parameter, or don't have it > under the same name. > b: a desire to read or write files for uses in 'the platform encoding' > whatever it is, in addition to whatever needs to be done in UTF-8. > > I'd be very surprised if Accumulo ever decided to do this sort of > thing in UTF-16. > > > > > > I would say that the reason that such a JVM property exists is to > alleviate > > you from having to make these code changes in the first place. > > > > On 10/29/2012 12:00 PM, David Medinets wrote: > >> > >> I like the idea of making the change explicit in the source code. > >> Setting the encoding in the jvm property would be easier but not as > >> explicit. I have a few dozen of the files changed. Today I have free > >> time since Hurricane Sandy has closed offices. > >> > >> On Mon, Oct 29, 2012 at 11:39 AM, William Slacum > >> <[EMAIL PROTECTED]> wrote: > >>> > >>> Isn't it easier to just set the JVM property `file.encoding`? > >>> > >>> On Sun, Oct 28, 2012 at 3:18 PM, Ed Kohlwey <[EMAIL PROTECTED]> > wrote: > >>> > >>>> If you use a private static field in each class for the charset, it > will > >>>> basically be a singleton because charsets are cached in char > >>>> set.forname. > >>>> IMHO this is a somewhat cleaner approach than having lots of static > >>>> imports > >>>> to utility classes with lots of constants in them. > >>>> On Oct 28, 2012 5:50 PM, "David Medinets" <[EMAIL PROTECTED]> > >>>> wrote: > >>>> > >>>>> > >>>>> > >>>> > >>>> > https://issues.apache.org/jira/browse/ACCUMULO-241?focusedCommentId=13449680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13449680 > >>>>> > >>>>> > >>>>> In this comment, John mentioned that all getBytes() method calls > >>>>> should be changed to use UTF8. There are about 1,800 getBytes() calls > >>>>> and not all of them involve String objects. I am working on ways to > >>>>> identify a subset of these calls to change. > >>>>> > >>>>> I have created https://issues.apache.org/jira/browse/ACCUMULO-836 to > >>>>> track this issue. > >>>>> > >>>>> Should we create one static Charset object? > >>>>> > >>>>> Class AccumuloDefaultCharset { > >>>>> public static Charset UTF8 = Charset.forName("UTF8"); > >>>>> } > >>>>> > >>>>> Should we use a static constant? > >>>>> > >>>>> public static String UTF8 = "UTF8"; > >>>>> > >>>>> I have found one instance of getBytes() in InputFormatBase: > >>>>> > >>>>> protected static byte[] getPassword(Configuration conf) { > >>>>> return Base64.decodeBase64(conf.get(PASSWORD, "").getBytes()); > >>>>> } > >>>>> > >>>>> Are there any reasons why I can't start specifying the charset? Is > >>>>> UTF8 the right Charset to use? I am not an expert in non-English > >>>>> charsets, so guidance would be welcome. > >>>>> > > +
John Vines 2012-10-29, 16:42
-
Re: Setting Charset in getBytes() call.Josh Elser 2012-10-29, 16:57
Benson, perhaps "contrived" would have been better than "hypothetical"
:). That being said, I also hadn't thought about other JVM implementations. I wonder if leaving a commented note in the accumulo-env.sh script for alternative namings for the "file.encoding" name and the JVM it applies to would be sufficient? David, can you give some sort of feel for the usages of the getBytes() calls? Since most of the API deals with things in terms of Text and byte[] (Key and Value decomposed), are most of the usages configuration/user-input based as your initial snippet from InputFormatBase showed? On 10/29/2012 12:42 PM, John Vines wrote: > Are there any experts when it comes to character encodings? First of all, I > would like to make sure there are no sacrifices being made by forcing UTF-8. > > From there, if I think JVM properties is the way to go. Should there be ANY > sort of shortfall with UTF-8, we should allow users to switch the encoding > to the type of their pleasure. We can tweak the scripts to set the jvm > property but still allow users to override should they need it in their > setup. This allows us to not only avoid a massive code change, it also > makes it easier for users to switch to an encoding should they have a need > to. > > John > > On Mon, Oct 29, 2012 at 12:24 PM, Benson Margulies <[EMAIL PROTECTED]>wrote: > >> On Mon, Oct 29, 2012 at 12:21 PM, Josh Elser <[EMAIL PROTECTED]> wrote: >>> David, I beg to differ. >>> >>> Setting it via the JVM property is a single change to make, whereas if >> you >>> change every single usage of getBytes(), you now forced the next person >> to >>> branch the code, change everything to UTF16 (hypothetical use case) and >>> continue a diverged codebase forever. >> >> Typically, the reason(s) that people don't take this approach are: >> >> a: a fear that other JVMs don't have this parameter, or don't have it >> under the same name. >> b: a desire to read or write files for uses in 'the platform encoding' >> whatever it is, in addition to whatever needs to be done in UTF-8. >> >> I'd be very surprised if Accumulo ever decided to do this sort of >> thing in UTF-16. >> >> >>> >>> I would say that the reason that such a JVM property exists is to >> alleviate >>> you from having to make these code changes in the first place. >>> >>> On 10/29/2012 12:00 PM, David Medinets wrote: >>>> >>>> I like the idea of making the change explicit in the source code. >>>> Setting the encoding in the jvm property would be easier but not as >>>> explicit. I have a few dozen of the files changed. Today I have free >>>> time since Hurricane Sandy has closed offices. >>>> >>>> On Mon, Oct 29, 2012 at 11:39 AM, William Slacum >>>> <[EMAIL PROTECTED]> wrote: >>>>> >>>>> Isn't it easier to just set the JVM property `file.encoding`? >>>>> >>>>> On Sun, Oct 28, 2012 at 3:18 PM, Ed Kohlwey <[EMAIL PROTECTED]> >> wrote: >>>>> >>>>>> If you use a private static field in each class for the charset, it >> will >>>>>> basically be a singleton because charsets are cached in char >>>>>> set.forname. >>>>>> IMHO this is a somewhat cleaner approach than having lots of static >>>>>> imports >>>>>> to utility classes with lots of constants in them. >>>>>> On Oct 28, 2012 5:50 PM, "David Medinets" <[EMAIL PROTECTED]> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >> https://issues.apache.org/jira/browse/ACCUMULO-241?focusedCommentId=13449680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13449680 >>>>>>> >>>>>>> >>>>>>> In this comment, John mentioned that all getBytes() method calls >>>>>>> should be changed to use UTF8. There are about 1,800 getBytes() calls >>>>>>> and not all of them involve String objects. I am working on ways to >>>>>>> identify a subset of these calls to change. >>>>>>> >>>>>>> I have created https://issues.apache.org/jira/browse/ACCUMULO-836 to >>>>>>> track this issue. >>>>>>> >>>>>>> Should we create one static Charset object? +
Josh Elser 2012-10-29, 16:57
-
Re: Setting Charset in getBytes() call.David Medinets 2012-10-29, 17:00
> David, can you give some sort of feel for the usages of the getBytes()
> calls? Since most of the API deals with things in terms of Text and byte[] > (Key and Value decomposed), are most of the usages configuration/user-input > based as your initial snippet from InputFormatBase showed? I will post a list of the files that I have changed before I commit. I will post the file list as a response in this thread. +
David Medinets 2012-10-29, 17:00
-
Re: Setting Charset in getBytes() call.William Slacum 2012-10-29, 17:13
Since this only effects Strings, I'm even more inclined to leave the option
at the JVM. Most of our methods that accept a `CharSequence` or `String` object end up creating a `Text` object based off them, which encodes them with UTF-8. I'd much rather make it our convention to always convert `String` to `Text` objects if we need to deal with them in a textual way; otherwise we're just dealing with `byte[]` when serializing keys and values. Now, it's another story if Thrift is serializing `String`s with the JVM setting... On Mon, Oct 29, 2012 at 1:00 PM, David Medinets <[EMAIL PROTECTED]>wrote: > > David, can you give some sort of feel for the usages of the getBytes() > > calls? Since most of the API deals with things in terms of Text and > byte[] > > (Key and Value decomposed), are most of the usages > configuration/user-input > > based as your initial snippet from InputFormatBase showed? > > I will post a list of the files that I have changed before I commit. I > will post the file list as a response in this thread. > +
William Slacum 2012-10-29, 17:13
-
Re: Setting Charset in getBytes() call.Mike Drob 2012-10-29, 17:16
One specific use case is when creating a new connection, the password is
passed as a byte[], when I expect most sane applications will treat it as a String (either via reading it from a file, or reading it from a terminal input). If somebody creates the password with a different platform encoding than what the programmer expects, then it will cause a lock out that is very difficult to debug. On topic to the original question, if anybody is brave enough to use Java 7, then there are predefined constants in the JDK - http://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html UTF-8 should always be present (according to the JLS), and as a multi-byte format should be able to encode any character that you would need to. I've had this conversation with Keith before, so hopefully he can weigh in on this. Mike On Mon, Oct 29, 2012 at 12:57 PM, Josh Elser <[EMAIL PROTECTED]> wrote: > Benson, perhaps "contrived" would have been better than "hypothetical" :). > That being said, I also hadn't thought about other JVM implementations. > > I wonder if leaving a commented note in the accumulo-env.sh script for > alternative namings for the "file.encoding" name and the JVM it applies to > would be sufficient? > > David, can you give some sort of feel for the usages of the getBytes() > calls? Since most of the API deals with things in terms of Text and byte[] > (Key and Value decomposed), are most of the usages configuration/user-input > based as your initial snippet from InputFormatBase showed? > > > On 10/29/2012 12:42 PM, John Vines wrote: > >> Are there any experts when it comes to character encodings? First of all, >> I >> would like to make sure there are no sacrifices being made by forcing >> UTF-8. >> >> From there, if I think JVM properties is the way to go. Should there be >> ANY >> sort of shortfall with UTF-8, we should allow users to switch the encoding >> to the type of their pleasure. We can tweak the scripts to set the jvm >> property but still allow users to override should they need it in their >> setup. This allows us to not only avoid a massive code change, it also >> makes it easier for users to switch to an encoding should they have a need >> to. >> >> John >> >> On Mon, Oct 29, 2012 at 12:24 PM, Benson Margulies <[EMAIL PROTECTED] >> >wrote: >> >> On Mon, Oct 29, 2012 at 12:21 PM, Josh Elser <[EMAIL PROTECTED]> >>> wrote: >>> >>>> David, I beg to differ. >>>> >>>> Setting it via the JVM property is a single change to make, whereas if >>>> >>> you >>> >>>> change every single usage of getBytes(), you now forced the next person >>>> >>> to >>> >>>> branch the code, change everything to UTF16 (hypothetical use case) and >>>> continue a diverged codebase forever. >>>> >>> >>> Typically, the reason(s) that people don't take this approach are: >>> >>> a: a fear that other JVMs don't have this parameter, or don't have it >>> under the same name. >>> b: a desire to read or write files for uses in 'the platform encoding' >>> whatever it is, in addition to whatever needs to be done in UTF-8. >>> >>> I'd be very surprised if Accumulo ever decided to do this sort of >>> thing in UTF-16. >>> >>> >>> >>>> I would say that the reason that such a JVM property exists is to >>>> >>> alleviate >>> >>>> you from having to make these code changes in the first place. >>>> >>>> On 10/29/2012 12:00 PM, David Medinets wrote: >>>> >>>>> >>>>> I like the idea of making the change explicit in the source code. >>>>> Setting the encoding in the jvm property would be easier but not as >>>>> explicit. I have a few dozen of the files changed. Today I have free >>>>> time since Hurricane Sandy has closed offices. >>>>> >>>>> On Mon, Oct 29, 2012 at 11:39 AM, William Slacum >>>>> <wilhelm.von.cloud@accumulo.**net <[EMAIL PROTECTED]>> >>>>> wrote: >>>>> >>>>>> >>>>>> Isn't it easier to just set the JVM property `file.encoding`? >>>>>> >>>>>> On Sun, Oct 28, 2012 at 3:18 PM, Ed Kohlwey <[EMAIL PROTECTED]> +
Mike Drob 2012-10-29, 17:16
-
Re: Setting Charset in getBytes() call.Michael Flester 2012-10-29, 19:14
> UTF-8 should always be present (according to the JLS), and as a multi-byte
> format should be able to encode any character that you would need to. > UTF-8 cannot encode arbitrary data. All data that we store in accumulo is not characters. A safe encoding to use as a pass through when you don't know if you are dealing with characters is ISO-8859-1 since we know that we can make the round trip from bytes to string to bytes without loss. +
Michael Flester 2012-10-29, 19:14
-
Re: Setting Charset in getBytes() call.John Vines 2012-10-29, 19:18
So perhaps we should have ISO-8859-1 as the standard. Mike- do you see any
reason to use something beside ISO-8859-1 for the encodings? John On Mon, Oct 29, 2012 at 3:14 PM, Michael Flester <[EMAIL PROTECTED]> wrote: > > UTF-8 should always be present (according to the JLS), and as a > multi-byte > > format should be able to encode any character that you would need to. > > > > UTF-8 cannot encode arbitrary data. All data that we store in accumulo > is not characters. A safe encoding to use as a pass through when you > don't know if you are dealing with characters is ISO-8859-1 since we know > that we can make the round trip from bytes to string to bytes without loss. > +
John Vines 2012-10-29, 19:18
-
Re: Setting Charset in getBytes() call.Benson Margulies 2012-10-29, 20:02
On Mon, Oct 29, 2012 at 3:18 PM, John Vines <[EMAIL PROTECTED]> wrote:
> So perhaps we should have ISO-8859-1 as the standard. Mike- do you see any > reason to use something beside ISO-8859-1 for the encodings? I object and caution against *any* plan that involves transcoding from X to UTF-16 and back where when the data is not always going to be valid bytes of encoding X. The only clean solution here is to have an API entirely in terms of bytes, and either let the user do getBytes if they want to store string data, or provide additional API. > > John > > On Mon, Oct 29, 2012 at 3:14 PM, Michael Flester <[EMAIL PROTECTED]> wrote: > >> > UTF-8 should always be present (according to the JLS), and as a >> multi-byte >> > format should be able to encode any character that you would need to. >> > >> >> UTF-8 cannot encode arbitrary data. All data that we store in accumulo >> is not characters. A safe encoding to use as a pass through when you >> don't know if you are dealing with characters is ISO-8859-1 since we know >> that we can make the round trip from bytes to string to bytes without loss. >> +
Benson Margulies 2012-10-29, 20:02
-
Re: Setting Charset in getBytes() call.David Medinets 2012-10-29, 20:29
Anytime that I've encountered non-English character sets, the answer
has been to use UTF-8. I'm moving forward with that assumption since it is safe change. If the group decides to use a different default encoding, it will be trivial to build on the work that I've done identifying getBytes() calls. I will post a list of files and my methodology before a svn checkin. On Mon, Oct 29, 2012 at 4:02 PM, Benson Margulies <[EMAIL PROTECTED]> wrote: > On Mon, Oct 29, 2012 at 3:18 PM, John Vines <[EMAIL PROTECTED]> wrote: >> So perhaps we should have ISO-8859-1 as the standard. Mike- do you see any >> reason to use something beside ISO-8859-1 for the encodings? > > I object and caution against *any* plan that involves transcoding from > X to UTF-16 and back where when the data is not always going to be > valid bytes of encoding X. The only clean solution here is to have an > API entirely in terms of bytes, and either let the user do getBytes if > they want to store string data, or provide additional API. > > > >> >> John >> >> On Mon, Oct 29, 2012 at 3:14 PM, Michael Flester <[EMAIL PROTECTED]> wrote: >> >>> > UTF-8 should always be present (according to the JLS), and as a >>> multi-byte >>> > format should be able to encode any character that you would need to. >>> > >>> >>> UTF-8 cannot encode arbitrary data. All data that we store in accumulo >>> is not characters. A safe encoding to use as a pass through when you >>> don't know if you are dealing with characters is ISO-8859-1 since we know >>> that we can make the round trip from bytes to string to bytes without loss. >>> +
David Medinets 2012-10-29, 20:29
-
Re: Setting Charset in getBytes() call.Michael Flester 2012-10-30, 00:27
I agree with Benson entirely with one caveat. It seems to me that there
might be two categories of things being discussed 1. User data (keys and values) 2. Ancillary things needed for operation of Accumulo (passwords). These could well be considered separately. Trying to do anything with keys and values other than treating them as bytes all of the time I find quite scary. And if this is only being done to satisfy pmd or findbugs, those tools can be convinced to modify their reporting about this issue. +
Michael Flester 2012-10-30, 00:27
-
Re: Setting Charset in getBytes() call.Josh Elser 2012-10-30, 00:46
+1 Mike.
1. It would be hard for me to believe Key/Value are ever handled internally in terms of Strings, but, if such a case does exist, it would be extremely prudent to fix. 2. FWIW, the Shell does use ISO-8859-1 as its charset which is referenced by other commands [1,2]. It would be good to double check all of the other commands. [1] https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/Shell.java [2] https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/commands/InsertCommand.java On 10/29/2012 8:27 PM, Michael Flester wrote: > I agree with Benson entirely with one caveat. It seems to me that there > might be two categories of things being discussed > > 1. User data (keys and values) > 2. Ancillary things needed for operation of Accumulo (passwords). > > These could well be considered separately. Trying to do anything with > keys and values other than treating them as bytes all of the time > I find quite scary. > > And if this is only being done to satisfy pmd or findbugs, those tools > can be convinced to modify their reporting about this issue. > +
Josh Elser 2012-10-30, 00:46
-
Re: Setting Charset in getBytes() call.Benson Margulies 2012-10-30, 00:54
On Mon, Oct 29, 2012 at 8:46 PM, Josh Elser <[EMAIL PROTECTED]> wrote:
> +1 Mike. > > 1. It would be hard for me to believe Key/Value are ever handled internally > in terms of Strings, but, if such a case does exist, it would be extremely > prudent to fix. > > 2. FWIW, the Shell does use ISO-8859-1 as its charset which is referenced by > other commands [1,2]. It would be good to double check all of the other > commands. I'm a bit lost. Any possible Java String can be rendered in UTF-8. So, if you are calling String.getBytes to turn a string into some bytes for some purpose, I think you need UTF-8. On the other hand, as Mike pointed out, new String(somebytes, "utf-8") will destroy data for some byte values that are not, in fact, UTF-8. By why would Accumulo ever need to string-ify some array of bytes of uncertain parentage? > > [1] > https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/Shell.java > [2] > https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/commands/InsertCommand.java > > > On 10/29/2012 8:27 PM, Michael Flester wrote: >> >> I agree with Benson entirely with one caveat. It seems to me that there >> might be two categories of things being discussed >> >> 1. User data (keys and values) >> 2. Ancillary things needed for operation of Accumulo (passwords). >> >> These could well be considered separately. Trying to do anything with >> keys and values other than treating them as bytes all of the time >> I find quite scary. >> >> And if this is only being done to satisfy pmd or findbugs, those tools >> can be convinced to modify their reporting about this issue. >> > +
Benson Margulies 2012-10-30, 00:54
-
Re: Setting Charset in getBytes() call.Josh Elser 2012-10-30, 01:57
I'm saying that I don't know of anything in the core API which performs
a getBytes() on the data itself. Accumulo itself is agnostic dealing only in byte[]. I think we're saying the same thing.. On 10/29/2012 8:54 PM, Benson Margulies wrote: > On Mon, Oct 29, 2012 at 8:46 PM, Josh Elser <[EMAIL PROTECTED]> wrote: >> +1 Mike. >> >> 1. It would be hard for me to believe Key/Value are ever handled internally >> in terms of Strings, but, if such a case does exist, it would be extremely >> prudent to fix. >> >> 2. FWIW, the Shell does use ISO-8859-1 as its charset which is referenced by >> other commands [1,2]. It would be good to double check all of the other >> commands. > > I'm a bit lost. Any possible Java String can be rendered in UTF-8. So, > if you are calling String.getBytes to turn a string into some bytes > for some purpose, I think you need UTF-8. > > On the other hand, as Mike pointed out, new String(somebytes, "utf-8") > will destroy data for some byte values that are not, in fact, UTF-8. > By why would Accumulo ever need to string-ify some array of bytes of > uncertain parentage? > > >> >> [1] >> https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/Shell.java >> [2] >> https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/commands/InsertCommand.java >> >> >> On 10/29/2012 8:27 PM, Michael Flester wrote: >>> >>> I agree with Benson entirely with one caveat. It seems to me that there >>> might be two categories of things being discussed >>> >>> 1. User data (keys and values) >>> 2. Ancillary things needed for operation of Accumulo (passwords). >>> >>> These could well be considered separately. Trying to do anything with >>> keys and values other than treating them as bytes all of the time >>> I find quite scary. >>> >>> And if this is only being done to satisfy pmd or findbugs, those tools >>> can be convinced to modify their reporting about this issue. >>> >> +
Josh Elser 2012-10-30, 01:57
-
Re: Setting Charset in getBytes() call.John Vines 2012-10-30, 02:08
We also need to be concerned about any string convenience classes using an
encoding scheme that still has some logical sorting (if that's an issue). Sent from my phone, pardon the typos and brevity. On Oct 29, 2012 9:57 PM, "Josh Elser" <[EMAIL PROTECTED]> wrote: > I'm saying that I don't know of anything in the core API which performs a > getBytes() on the data itself. Accumulo itself is agnostic dealing only in > byte[]. I think we're saying the same thing.. > > On 10/29/2012 8:54 PM, Benson Margulies wrote: > >> On Mon, Oct 29, 2012 at 8:46 PM, Josh Elser <[EMAIL PROTECTED]> wrote: >> >>> +1 Mike. >>> >>> 1. It would be hard for me to believe Key/Value are ever handled >>> internally >>> in terms of Strings, but, if such a case does exist, it would be >>> extremely >>> prudent to fix. >>> >>> 2. FWIW, the Shell does use ISO-8859-1 as its charset which is >>> referenced by >>> other commands [1,2]. It would be good to double check all of the other >>> commands. >>> >> >> I'm a bit lost. Any possible Java String can be rendered in UTF-8. So, >> if you are calling String.getBytes to turn a string into some bytes >> for some purpose, I think you need UTF-8. >> >> On the other hand, as Mike pointed out, new String(somebytes, "utf-8") >> will destroy data for some byte values that are not, in fact, UTF-8. >> By why would Accumulo ever need to string-ify some array of bytes of >> uncertain parentage? >> >> >> >>> [1] >>> https://github.com/apache/**accumulo/blob/trunk/core/src/** >>> main/java/org/apache/accumulo/**core/util/shell/Shell.java<https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/Shell.java> >>> [2] >>> https://github.com/apache/**accumulo/blob/trunk/core/src/** >>> main/java/org/apache/accumulo/**core/util/shell/commands/** >>> InsertCommand.java<https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/commands/InsertCommand.java> >>> >>> >>> On 10/29/2012 8:27 PM, Michael Flester wrote: >>> >>>> >>>> I agree with Benson entirely with one caveat. It seems to me that there >>>> might be two categories of things being discussed >>>> >>>> 1. User data (keys and values) >>>> 2. Ancillary things needed for operation of Accumulo (passwords). >>>> >>>> These could well be considered separately. Trying to do anything with >>>> keys and values other than treating them as bytes all of the time >>>> I find quite scary. >>>> >>>> And if this is only being done to satisfy pmd or findbugs, those tools >>>> can be convinced to modify their reporting about this issue. >>>> >>>> >>> +
John Vines 2012-10-30, 02:08
-
Re: Setting Charset in getBytes() call.David Medinets 2012-10-30, 02:47
I've looked at every getBytes() call. I have changed 82 files to use
getBytes(utf8). Each files uses the following Charset declaration. private static final Charset utf8 = Charset.forName("UTF8"); If at some future time the Charset should be changed or another approach is decided upon, simply search for that string and refactoring will be straightforward. The attached file shows how I performed the search for getBytes() and has a list of files that continue to have getBytes() because it is called on a Text or some other kind of object. The code just prints a list of files using getBytes(). Then I manually reviewed the files, made changes or added the file name to the ignore list. The code compiles and the tests run. I don't see any reason why I should not commit my work to the trunk for v1.5.0. I don't want to cause disharmony but I can't see the harm. And even if my change causes some problem, wouldn't it be better to know that while v1.5.0 is still being actively developed? +
David Medinets 2012-10-30, 02:47
-
Re: Setting Charset in getBytes() call.Josh Elser 2012-10-30, 22:27
On 10/29/2012 10:47 PM, David Medinets wrote:
> > The code compiles and the tests run. I don't see any reason why I > should not commit my work to the trunk for v1.5.0. I don't want to > cause disharmony but I can't see the harm. And even if my change > causes some problem, wouldn't it be better to know that while v1.5.0 > is still being actively developed? > My issue with this is that you have now hard-coded the fact that everyone else is going to use UTF-8. Yes, it most likely won't affect any compilation or tests (or even 90% of users), but I do not agree that this is the best way to fix the problem. I still contest that setting the desired encoding (via the appropriate JVM property like Bill Slacum initial suggested) is the proper way to address the issue. I don't feel like we ever actually came to a consensus on this discussion. +
Josh Elser 2012-10-30, 22:27
-
Re: Setting Charset in getBytes() call.David Medinets 2012-10-30, 23:47
> My issue with this is that you have now hard-coded the fact that everyone else is going to use UTF-8.
Who is everyone else? I agree that I have hard-coded the use of UTF-8. On the other hand, I've merely codified an existing practice. Thus the issue is now exposed, the places the convention is used are defined. Once a consensus is reached, we can implement it with confidence. > way to fix the problem. I still contest that setting the desired encoding > (via the appropriate JVM property like Bill Slacum initial suggested) is the > proper way to address the issue. It is easy to do both. Create a ByteEncodingInitializer (or somesuch) class that reads the JVM property and defines a globally used Charset. The find those utf8 definitions and usages and replace them with the globally-defined value. +
David Medinets 2012-10-30, 23:47
-
Re: Setting Charset in getBytes() call.Josh Elser 2012-10-31, 00:21
On 10/30/2012 7:47 PM, David Medinets wrote:
>> My issue with this is that you have now hard-coded the fact that everyone else is going to use UTF-8. > > Who is everyone else? I agree that I have hard-coded the use of UTF-8. > On the other hand, I've merely codified an existing practice. Thus the > issue is now exposed, the places the convention is used are defined. > Once a consensus is reached, we can implement it with confidence. "Everyone else" is everyone who builds Accumulo since you committed your changes and uses it. Ignoring that, forcing a single charset isn't the big issue here (as we've *all* agreed that UTF-8 should not cause any data-correctness issues) so for now I'll just drop it as it's just creating confusion. My issue is *how* you implemented the default charset. We already have 3 people (Marc, Bill and myself) who have stated that we believe inline charset declaration is not the correct implementation and that using the JVM property is the better implementation. I'd encourage others to weigh in to form a complete consensus and shift the discussion to that implementation if needed. > >> way to fix the problem. I still contest that setting the desired encoding >> (via the appropriate JVM property like Bill Slacum initial suggested) is the >> proper way to address the issue. > > It is easy to do both. Create a ByteEncodingInitializer (or somesuch) > class that reads the JVM property and defines a globally used Charset. > The find those utf8 definitions and usages and replace them with the > globally-defined value. Again, by setting the 'file.encoding' JVM parameter, such a class is unnecessary because it should be handled internal to Java. For Oracle/Sun JDK and OpenJDK, setting the "file.encoding" parameter at run time will use the provided charset you wanted without actually changing any code. +
Josh Elser 2012-10-31, 00:21
-
Re: Setting Charset in getBytes() call.Benson Margulies 2012-10-31, 00:31
On Tue, Oct 30, 2012 at 8:21 PM, Josh Elser <[EMAIL PROTECTED]> wrote:
> On 10/30/2012 7:47 PM, David Medinets wrote: >>> >>> My issue with this is that you have now hard-coded the fact that everyone >>> else is going to use UTF-8. >> >> >> Who is everyone else? I agree that I have hard-coded the use of UTF-8. >> On the other hand, I've merely codified an existing practice. Thus the >> issue is now exposed, the places the convention is used are defined. >> Once a consensus is reached, we can implement it with confidence. > > > "Everyone else" is everyone who builds Accumulo since you committed your > changes and uses it. Ignoring that, forcing a single charset isn't the big > issue here (as we've *all* agreed that UTF-8 should not cause any > data-correctness issues) so for now I'll just drop it as it's just creating > confusion. > > My issue is *how* you implemented the default charset. We already have 3 > people (Marc, Bill and myself) who have stated that we believe inline > charset declaration is not the correct implementation and that using the JVM > property is the better implementation. > > I'd encourage others to weigh in to form a complete consensus and shift the > discussion to that implementation if needed. > >> >>> way to fix the problem. I still contest that setting the desired encoding >>> (via the appropriate JVM property like Bill Slacum initial suggested) is >>> the >>> proper way to address the issue. >> >> >> It is easy to do both. Create a ByteEncodingInitializer (or somesuch) >> class that reads the JVM property and defines a globally used Charset. >> The find those utf8 definitions and usages and replace them with the >> globally-defined value. > > > Again, by setting the 'file.encoding' JVM parameter, such a class is > unnecessary because it should be handled internal to Java. For Oracle/Sun > JDK and OpenJDK, setting the "file.encoding" parameter at run time will use > the provided charset you wanted without actually changing any code. If Accumulo was only a pile of servers, you could do this. You could say that part of the configuration process for the servers is to specify the desired encoding to file.encoding, and your shell scripts could set UTF-8 by default. But Accumulo is *not* just a pile of servers. Setting file.encoding effects the entire JVM. A webapp that uses Accumulo now would need to have the entire servlet container have a particular setting of file.encoding. This just does not work in the wild. Even without the servlet container issue, a user of Accumulo may need to plug it into an existing code base that has other reasons to set file.encoding, and will not like it when Accumulo starts to corrupt his or her string data. +
Benson Margulies 2012-10-31, 00:31
-
Re: Setting Charset in getBytes() call.William Slacum 2012-10-31, 00:41
Accumulo may not be just a set of servers, but it is designed to be a set
of processes, which means having their own JVM. I think this mostly boils down to an issue of API however-- if Accumulo deals with user's data in terms of bytes, then this issue is put back on the user, which I'm fine with as a trade off between configuration versus convention. There are other cases beyond simply a client API, though, namely configuration. I'm more comfortable with enforcing some standard there. On Tue, Oct 30, 2012 at 8:31 PM, Benson Margulies <[EMAIL PROTECTED]>wrote: > On Tue, Oct 30, 2012 at 8:21 PM, Josh Elser <[EMAIL PROTECTED]> wrote: > > On 10/30/2012 7:47 PM, David Medinets wrote: > >>> > >>> My issue with this is that you have now hard-coded the fact that > everyone > >>> else is going to use UTF-8. > >> > >> > >> Who is everyone else? I agree that I have hard-coded the use of UTF-8. > >> On the other hand, I've merely codified an existing practice. Thus the > >> issue is now exposed, the places the convention is used are defined. > >> Once a consensus is reached, we can implement it with confidence. > > > > > > "Everyone else" is everyone who builds Accumulo since you committed your > > changes and uses it. Ignoring that, forcing a single charset isn't the > big > > issue here (as we've *all* agreed that UTF-8 should not cause any > > data-correctness issues) so for now I'll just drop it as it's just > creating > > confusion. > > > > My issue is *how* you implemented the default charset. We already have 3 > > people (Marc, Bill and myself) who have stated that we believe inline > > charset declaration is not the correct implementation and that using the > JVM > > property is the better implementation. > > > > I'd encourage others to weigh in to form a complete consensus and shift > the > > discussion to that implementation if needed. > > > >> > >>> way to fix the problem. I still contest that setting the desired > encoding > >>> (via the appropriate JVM property like Bill Slacum initial suggested) > is > >>> the > >>> proper way to address the issue. > >> > >> > >> It is easy to do both. Create a ByteEncodingInitializer (or somesuch) > >> class that reads the JVM property and defines a globally used Charset. > >> The find those utf8 definitions and usages and replace them with the > >> globally-defined value. > > > > > > Again, by setting the 'file.encoding' JVM parameter, such a class is > > unnecessary because it should be handled internal to Java. For Oracle/Sun > > JDK and OpenJDK, setting the "file.encoding" parameter at run time will > use > > the provided charset you wanted without actually changing any code. > > If Accumulo was only a pile of servers, you could do this. You could > say that part of the configuration process for the servers is to > specify the desired encoding to file.encoding, and your shell scripts > could set UTF-8 by default. > > But Accumulo is *not* just a pile of servers. Setting file.encoding > effects the entire JVM. A webapp that uses Accumulo now would need to > have the entire servlet container have a particular setting of > file.encoding. This just does not work in the wild. Even without the > servlet container issue, a user of Accumulo may need to plug it into > an existing code base that has other reasons to set file.encoding, and > will not like it when Accumulo starts to corrupt his or her string > data. > +
William Slacum 2012-10-31, 00:41
-
Re: Setting Charset in getBytes() call.David Medinets 2012-10-31, 02:29
Re-reading and re-thinking I can see your point about how, by
specifying UTF-8, Accumulo is now flouting the file.encoding parameter. I'd like to implement a static method inside core/src/main/java/org/apache/accumulo/core/util/Encoding.java. Then do something like getBytes(Encoding.getCharset()) instead of hard-coding UTF-8. Class Encoding { private static final Charset charset = null; public Charset getCharset() { if (charset == null) { charset = Charset.forName(System.getProperty("file.encoding", "UTF-8")); } return charset; } ... } +
David Medinets 2012-10-31, 02:29
-
Re: Setting Charset in getBytes() call.John Vines 2012-10-31, 02:35
Why not just have a configuration in the xml file for setting a global
charset? This way we avoid hard coded settings but also avoid the issue of shared vm issues. John Sent from my phone, pardon the typos and brevity. On Oct 30, 2012 10:29 PM, "David Medinets" <[EMAIL PROTECTED]> wrote: > Re-reading and re-thinking I can see your point about how, by > specifying UTF-8, Accumulo is now flouting the file.encoding > parameter. I'd like to implement a static method inside > core/src/main/java/org/apache/accumulo/core/util/Encoding.java. Then > do something like getBytes(Encoding.getCharset()) instead of > hard-coding UTF-8. > > Class Encoding { > private static final Charset charset = null; > public Charset getCharset() { > if (charset == null) { > charset = Charset.forName(System.getProperty("file.encoding", > "UTF-8")); > } > return charset; > } > ... > } > +
John Vines 2012-10-31, 02:35
-
Re: Setting Charset in getBytes() call.Christopher Tubbs 2012-10-31, 18:02
I've added my own comments to this thread on the ACCUMULO-840 ticket.
https://issues.apache.org/jira/browse/ACCUMULO-840?focusedCommentId=13488024&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13488024 -- Christopher L Tubbs II http://gravatar.com/ctubbsii On Tue, Oct 30, 2012 at 10:35 PM, John Vines <[EMAIL PROTECTED]> wrote: > Why not just have a configuration in the xml file for setting a global > charset? This way we avoid hard coded settings but also avoid the issue of > shared vm issues. > > John > > Sent from my phone, pardon the typos and brevity. > On Oct 30, 2012 10:29 PM, "David Medinets" <[EMAIL PROTECTED]> wrote: > >> Re-reading and re-thinking I can see your point about how, by >> specifying UTF-8, Accumulo is now flouting the file.encoding >> parameter. I'd like to implement a static method inside >> core/src/main/java/org/apache/accumulo/core/util/Encoding.java. Then >> do something like getBytes(Encoding.getCharset()) instead of >> hard-coding UTF-8. >> >> Class Encoding { >> private static final Charset charset = null; >> public Charset getCharset() { >> if (charset == null) { >> charset = Charset.forName(System.getProperty("file.encoding", >> "UTF-8")); >> } >> return charset; >> } >> ... >> } >> +
Christopher Tubbs 2012-10-31, 18:02
-
Re: Setting Charset in getBytes() call.Marc Parisi 2012-11-02, 12:24
John, that would lead us to a configuration management issue. To keep
configuration files in line would be the same as ensuring file.encoding is the same across the platform. The JLS doesn't specify a Charset encoding scheme; however, for quite some time the file.encoding fall through ( that is, when it's not specified ) is UTF-8. This could change, and is not backed by the JLS, yet, file.encoding is. It's a fallthrough, meant to take care of configuration mismanagement. Further, these changes will have issues if you specify a file.encoding in your configuration, as you don't always enforce UTF-8 in every String instance, especially in some of your aggregator changes. "If Accumulo was only a pile of servers, you could do this. You could say that part of the configuration process for the servers is to specify the desired encoding to file.encoding, and your shell scripts could set UTF-8 by default. But Accumulo is *not* just a pile of servers. Setting file.encoding effects the entire JVM. A webapp that uses Accumulo now would need to have the entire servlet container have a particular setting of file.encoding. This just does not work in the wild. Even without the servlet container issue, a user of Accumulo may need to plug it into an existing code base that has other reasons to set file.encoding, and will not like it when Accumulo starts to corrupt his or her string data." I gather that what you mean is that multiple, transient, execution paths within the tserver should support multiple encodings; however, setting file.encoding ensures that the platform, which is encompassed in a JVM, encodes and decodes values in an understood way ( that's what character set encodings are meant to enforce ). If a user wishes to have his or her own execution path ( or their own encoding for an iterator ), then he/she would likely define this. The fact that we require configuration parameters for the bulk of these changes in core is an indication that the core API contains features that are seeping into user functionality. Keep the encoding/decoding at client code, not within the tserver process. Use file.encoding for the core project, and our changeset would be much smaller, require that clients do their own encoding/decoding. A webapp is a fantastic example; however, let's take it a step further. Accumulo is JBoss. The iterator/client code is the webapp. We should separate Accumulo from client and client iterator code to avoid these design issues and place the onus on the user, not accumulo. In all honesty, and I'm probably off base, but in the case of iterators, we should move them to a different package, and if so desired, add options to the iterators, but there is no need to default to UTF-8. It's been that way for some time. On Wed, Oct 31, 2012 at 2:02 PM, Christopher Tubbs <[EMAIL PROTECTED]>wrote: > I've added my own comments to this thread on the ACCUMULO-840 ticket. > > https://issues.apache.org/jira/browse/ACCUMULO-840?focusedCommentId=13488024&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13488024 > > -- > Christopher L Tubbs II > http://gravatar.com/ctubbsii > > > On Tue, Oct 30, 2012 at 10:35 PM, John Vines <[EMAIL PROTECTED]> wrote: > > Why not just have a configuration in the xml file for setting a global > > charset? This way we avoid hard coded settings but also avoid the issue > of > > shared vm issues. > > > > John > > > > Sent from my phone, pardon the typos and brevity. > > On Oct 30, 2012 10:29 PM, "David Medinets" <[EMAIL PROTECTED]> > wrote: > > > >> Re-reading and re-thinking I can see your point about how, by > >> specifying UTF-8, Accumulo is now flouting the file.encoding > >> parameter. I'd like to implement a static method inside > >> core/src/main/java/org/apache/accumulo/core/util/Encoding.java. Then > >> do something like getBytes(Encoding.getCharset()) instead of > >> hard-coding UTF-8. > >> > >> Class Encoding { > >> private static final Charset charset = null; +
Marc Parisi 2012-11-02, 12:24
-
Re: Setting Charset in getBytes() call.Benson Margulies 2012-11-02, 19:56
Maybe I'm being particularly dense, but I still think that this is
being made too complex by failing to enumerate the specific goals. First case; data for which Accumulo is defined to persistently store *characters*, as opposed to bytes. I would hope that, in all such cases, we would agree that those characters should be stored in some Unicode format, never in some legacy encoding. Second case; data for which Accumulo is defined to store bytes, but, for convenience, an API allows the user to read and write characters. In this case, I can imagine two competing API designs. One would be to mirror Java, and in all such cases give the user the option of specifying the charset, defaulting to file.encoding. The other would be to insist on UTF-8. A third possibility - to just respect file.encoding - seems to me to be retreading the errors of Java 1.x. Third case; cases in which the user either supplies a text file for Accumulo to read, or asks Accumulo to write a text file. Having an API that can default to file.encoding here would be convenient for users, who want files in their platform's default encoding. Note that this is incompatible with the notion of *setting* file.encoding as an implementation technique for getting the string constructor and getBytes() to do UTF-8. Finally for today, I had a hard time following the response to my writing on servlets. I'll vastly simplify my presentation: when a user of Accumulo writes Java code that calls the Accumulo API, I find it unacceptable to require that user to set file.encoding to get correct behavior from Accumulo, except as described in the second case above. When Accumulo classes are integrated into user applications, Accumulo must respect file.encoding, or ignore file.encoding, but it cannot require the user to set it to something in particular to get correct behavior. +
Benson Margulies 2012-11-02, 19:56
-
Re: Setting Charset in getBytes() call.John Vines 2012-11-02, 20:18
Client/server mismatch is a giant problem. And the more combustibility we
put into Accumulo the closer we get to users hitting a knowledge barrier about knowing the specifics of their Accumulo instance. i believe there are two avenues for dealing with this- 1. Avoid at all costs. Unfortunately, this can ultimately boil down to users losing features because we don't want them to have any sort of intimate knowledge of the system. 2. A remote configuration utility. If we can have the client code pull the configuration from the server, perhaps when Connection is made, we can have our client APIs consistent on both sides of the channel. I believe a solution like this could handle the issue Benson mentions, but it also means we cannot approach this encoding issue with file.encoding. Personally, I think the second option is an inevitability for us as we do more and more features which are configuration specific. Either way, it does seem that file.encoding is not sufficient as we want to avoid the client code requiring some extremely specific documentation. it might even be an incompatible configuration with what the client wants to do. I think we are overgeneralizing this issue though. Josh did a decent job and starting to hammer away on this. It's not just a matter of us doing weird things with encodings, but the cases for them. For instance, all zookeeper operations need to be done the same way across the board. This is needs to be shared knowledge for both servers and clients. So these should have charset specifications. But other things (pulling things out of thin air), such as the client api methods, are outside of the purview. Primarily because they are not associated with any tables until well after they are created. So that is a user-space burden and should not be a concern with us. Or any sort of local string operation. It boils down to if it directly goes into HDFS, zookeeper, or the !METADATA table then we should enforce encoding, in the way Dave approached it. Outside of those scopes I think we should really just leave them the hell alone because the system shouldn't be messing with user's data. John On Fri, Nov 2, 2012 at 3:56 PM, Benson Margulies <[EMAIL PROTECTED]>wrote: > Maybe I'm being particularly dense, but I still think that this is > being made too complex by failing to enumerate the specific goals. > > First case; data for which Accumulo is defined to persistently store > *characters*, as opposed to bytes. I would hope that, in all such > cases, we would agree that those characters should be stored in some > Unicode format, never in some legacy encoding. > > Second case; data for which Accumulo is defined to store bytes, but, > for convenience, an API allows the user to read and write characters. > In this case, I can imagine two competing API designs. One would be to > mirror Java, and in all such cases give the user the option of > specifying the charset, defaulting to file.encoding. The other would > be to insist on UTF-8. A third possibility - to just respect > file.encoding - seems to me to be retreading the errors of Java 1.x. > > Third case; cases in which the user either supplies a text file for > Accumulo to read, or asks Accumulo to write a text file. Having an API > that can default to file.encoding here would be convenient for users, > who want files in their platform's default encoding. Note that this is > incompatible with the notion of *setting* file.encoding as an > implementation technique for getting the string constructor and > getBytes() to do UTF-8. > > Finally for today, I had a hard time following the response to my > writing on servlets. I'll vastly simplify my presentation: when a user > of Accumulo writes Java code that calls the Accumulo API, I find it > unacceptable to require that user to set file.encoding to get correct > behavior from Accumulo, except as described in the second case above. > When Accumulo classes are integrated into user applications, Accumulo > must respect file.encoding, or ignore file.encoding, but it cannot +
John Vines 2012-11-02, 20:18
-
Re: Setting Charset in getBytes() call.Christopher Tubbs 2012-11-03, 01:54
On Fri, Nov 2, 2012 at 3:56 PM, Benson Margulies <[EMAIL PROTECTED]> wrote:
> Maybe I'm being particularly dense, but I still think that this is > being made too complex by failing to enumerate the specific goals. I agree that there has been a failure to enumerate specific goals with regard to encoding. I made an attempt to identify potential goals (scopes), for which encoding matters on this ticket: https://issues.apache.org/jira/browse/ACCUMULO-840 In there, I identify two considerations: 1) API issues addressing consistency for user data (eg. passwords, table names, Mutation constructors that take Strings), and 2) INTERNAL issues related to Accumulo storing and reading state that persists or is communicated between its operating components (a clear example of this is how we store the !METADATA column family names, which start out as Java String literals, and get encoded to bytes by the time it gets stored in the table). I think #1 can be addressed by simply waiting until somebody presents a feature request with a use case, and in the meantime, we simply don't touch it. I think #2 can be addressed by establishing an internal policy (along the lines of our codestyle standards) that establishes that Accumulo will consistently store String data for its internal use as UTF8 when we have to store that String as bytes, and when we convert such bytes into Strings, we do so under the assumption it is UTF8. If we can agree to this policy, anything that is actually non-compliant (i.e. where there's a possibility it won't be stored or read as UTF8) will simply be a bug that we apply a very narrowly scoped bugfix to ensure consistency with the policy. I think David has already identified some such cases and attempted to fix them in the process of working on ACCUMULO-836. I think those are fine, but they need to be checked to ensure that when they are converted back to a String, they are read as UTF8. However, it might be better if these changes were split into separate bugs, because even though they are all the same class of bug, they apply to separate components (eg. "Potential bug - Inconsistent encoding with Zookeeper data", "Potential bug - Inconsistent encoding with mapreduce configuration", etc.). These bugs can be identified and fixed as we encounter them, rather than as an attempt to fix the entire code base. We shouldn't have to spend a lot of time on them... we should do the simple thing first: establish the policy. -- Christopher L Tubbs II http://gravatar.com/ctubbsii +
Christopher Tubbs 2012-11-03, 01:54
-
Re: Setting Charset in getBytes() call.David Medinets 2012-11-03, 03:34
Christopher, thanks for taking the time to writeup your comments. Nicely done.
+
David Medinets 2012-11-03, 03:34
-
Re: Setting Charset in getBytes() call.Josh Elser 2012-11-02, 23:34
I also worked through the changes and found some questionable changes.
https://issues.apache.org/jira/browse/ACCUMULO-836?focusedCommentId=13489228&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13489228 On 10/31/2012 01:02 PM, Christopher Tubbs wrote: > I've added my own comments to this thread on the ACCUMULO-840 ticket. > https://issues.apache.org/jira/browse/ACCUMULO-840?focusedCommentId=13488024&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13488024 > > -- > Christopher L Tubbs II > http://gravatar.com/ctubbsii > > > On Tue, Oct 30, 2012 at 10:35 PM, John Vines<[EMAIL PROTECTED]> wrote: >> Why not just have a configuration in the xml file for setting a global >> charset? This way we avoid hard coded settings but also avoid the issue of >> shared vm issues. >> >> John >> >> Sent from my phone, pardon the typos and brevity. >> On Oct 30, 2012 10:29 PM, "David Medinets"<[EMAIL PROTECTED]> wrote: >> >>> Re-reading and re-thinking I can see your point about how, by >>> specifying UTF-8, Accumulo is now flouting the file.encoding >>> parameter. I'd like to implement a static method inside >>> core/src/main/java/org/apache/accumulo/core/util/Encoding.java. Then >>> do something like getBytes(Encoding.getCharset()) instead of >>> hard-coding UTF-8. >>> >>> Class Encoding { >>> private static final Charset charset = null; >>> public Charset getCharset() { >>> if (charset == null) { >>> charset = Charset.forName(System.getProperty("file.encoding", >>> "UTF-8")); >>> } >>> return charset; >>> } >>> ... >>> } >>> +
Josh Elser 2012-11-02, 23:34
-
Re: Setting Charset in getBytes() call.Drew Farris 2012-10-30, 01:22
I have always wondered if there were cases in the API where users are
forced to use Text when they would otherwise prefer byte[], e.g: stuffing a non utf8 byte[] into a Text object to facilitate storage or sorting. Not entirely sure whether Text would complain if this were the case. I suspect we should seek to elimimate these if they currently exist. Speaking strictly of user data, I agree that fundamentally, every operation should be based upon byte[]. API methods providing Text and String based calls should be convience methods where the conversion of text to/from bytes is handled explicitly (not relying on platform default encoding or properties) and transparently (doing something sensible when the user doesn't care or is unaware of the issues surrounding character encoding). Regarding utf8, is there a need to support arbitrary character encodings when persisting bytes to accumulo? Think byte order for lexical sorting, fixed vs variable length, etc. Perhaps it would not be unreasonable to support explicitly stating a character encoding on table creation? Drew On Oct 29, 2012 8:47 PM, "Josh Elser" <[EMAIL PROTECTED]> wrote: > +1 Mike. > > 1. It would be hard for me to believe Key/Value are ever handled > internally in terms of Strings, but, if such a case does exist, it would be > extremely prudent to fix. > > 2. FWIW, the Shell does use ISO-8859-1 as its charset which is referenced > by other commands [1,2]. It would be good to double check all of the other > commands. > > [1] https://github.com/apache/**accumulo/blob/trunk/core/src/** > main/java/org/apache/accumulo/**core/util/shell/Shell.java<https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/Shell.java> > [2] https://github.com/apache/**accumulo/blob/trunk/core/src/** > main/java/org/apache/accumulo/**core/util/shell/commands/** > InsertCommand.java<https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/commands/InsertCommand.java> > > On 10/29/2012 8:27 PM, Michael Flester wrote: > >> I agree with Benson entirely with one caveat. It seems to me that there >> might be two categories of things being discussed >> >> 1. User data (keys and values) >> 2. Ancillary things needed for operation of Accumulo (passwords). >> >> These could well be considered separately. Trying to do anything with >> keys and values other than treating them as bytes all of the time >> I find quite scary. >> >> And if this is only being done to satisfy pmd or findbugs, those tools >> can be convinced to modify their reporting about this issue. >> >> +
Drew Farris 2012-10-30, 01:22
-
Re: Setting Charset in getBytes() call.Adam Fuchs 2012-10-30, 20:26
On Mon, Oct 29, 2012 at 9:22 PM, Drew Farris <[EMAIL PROTECTED]> wrote:
> I have always wondered if there were cases in the API where users are > forced to use Text when they would otherwise prefer byte[], e.g: stuffing a > non utf8 byte[] into a Text object to facilitate storage or sorting. Not > entirely sure whether Text would complain if this were the case. I suspect > we should seek to elimimate these if they currently exist. > The Text class is essentially a wrapper around a byte[], with some convenience methods for translating to/from other types. Accumulo only ever reads bytes out of it, so there is no encoding problem there. We also don't use most of its convenience methods. Many people see that it is named "Text" and assume that it only stores human readable text, but that is not the case. It probably should have been named "ConvenientByteArrayWrapperWithSomeMemoryEfficiencySupportAndStringOrientedTranslationMethodsThatIsWritableComparable". I also agree that it would be good to get rid of the reliance on Hadoop's Text object, especially because people often do not respect getLength() on the byte[] obtained from getBytes(). Adam +
Adam Fuchs 2012-10-30, 20:26
-
Re: Setting Charset in getBytes() call.Ed Kohlwey 2012-10-30, 01:44
I think there might be memory efficiency issues that should pursuade us to
adopt charsets rather than the current approach. I believe (rtfs to be sure) that charset.decode doesn't deep copy the underlying byte buffer which is presumably good from a gc standpoint. Either way, UTF8 is certainly the most widely used charset in existing deployments and changing the default to something that is non-backwards compatible is probably a bad idea. I'm not familiar with the characteristics of the alternatives, but I strongly believe any across the board change needs to be compatible with existing deployments. Perhaps a better approach than a JVM option or forcing one standard would be to create a configuration option. On Oct 29, 2012 9:22 PM, "Drew Farris" <[EMAIL PROTECTED]> wrote: > I have always wondered if there were cases in the API where users are > forced to use Text when they would otherwise prefer byte[], e.g: stuffing a > non utf8 byte[] into a Text object to facilitate storage or sorting. Not > entirely sure whether Text would complain if this were the case. I suspect > we should seek to elimimate these if they currently exist. > > Speaking strictly of user data, I agree that fundamentally, every operation > should be based upon byte[]. API methods providing Text and String based > calls should be convience methods where the conversion of text to/from > bytes is handled explicitly (not relying on platform default encoding or > properties) and transparently (doing something sensible when the user > doesn't care or is unaware of the issues surrounding character encoding). > > Regarding utf8, is there a need to support arbitrary character encodings > when persisting bytes to accumulo? Think byte order for lexical sorting, > fixed vs variable length, etc. Perhaps it would not be unreasonable to > support explicitly stating a character encoding on table creation? > > Drew > On Oct 29, 2012 8:47 PM, "Josh Elser" <[EMAIL PROTECTED]> wrote: > > > +1 Mike. > > > > 1. It would be hard for me to believe Key/Value are ever handled > > internally in terms of Strings, but, if such a case does exist, it would > be > > extremely prudent to fix. > > > > 2. FWIW, the Shell does use ISO-8859-1 as its charset which is referenced > > by other commands [1,2]. It would be good to double check all of the > other > > commands. > > > > [1] https://github.com/apache/**accumulo/blob/trunk/core/src/** > > main/java/org/apache/accumulo/**core/util/shell/Shell.java< > https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/Shell.java > > > > [2] https://github.com/apache/**accumulo/blob/trunk/core/src/** > > main/java/org/apache/accumulo/**core/util/shell/commands/** > > InsertCommand.java< > https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/commands/InsertCommand.java > > > > > > On 10/29/2012 8:27 PM, Michael Flester wrote: > > > >> I agree with Benson entirely with one caveat. It seems to me that there > >> might be two categories of things being discussed > >> > >> 1. User data (keys and values) > >> 2. Ancillary things needed for operation of Accumulo (passwords). > >> > >> These could well be considered separately. Trying to do anything with > >> keys and values other than treating them as bytes all of the time > >> I find quite scary. > >> > >> And if this is only being done to satisfy pmd or findbugs, those tools > >> can be convinced to modify their reporting about this issue. > >> > >> > +
Ed Kohlwey 2012-10-30, 01:44
-
Re: Setting Charset in getBytes() call.Ed Kohlwey 2012-10-30, 01:54
Also, on the topic of byte arrays - we should do one better than hbase and
go for ByteBuffers. They are more reusable and long-lived buffers can be allocated outside the heap and take advantage of OS I/O optimizations. The current reliance on Text is in my opinion the greatest deficit of the client API- I have been fiddling with creating a new API, similar to the work Keith did with typo, but instead looking at introducing generic superclasses to reduce the API profile. On Oct 29, 2012 9:22 PM, "Drew Farris" <[EMAIL PROTECTED]> wrote: > I have always wondered if there were cases in the API where users are > forced to use Text when they would otherwise prefer byte[], e.g: stuffing a > non utf8 byte[] into a Text object to facilitate storage or sorting. Not > entirely sure whether Text would complain if this were the case. I suspect > we should seek to elimimate these if they currently exist. > > Speaking strictly of user data, I agree that fundamentally, every operation > should be based upon byte[]. API methods providing Text and String based > calls should be convience methods where the conversion of text to/from > bytes is handled explicitly (not relying on platform default encoding or > properties) and transparently (doing something sensible when the user > doesn't care or is unaware of the issues surrounding character encoding). > > Regarding utf8, is there a need to support arbitrary character encodings > when persisting bytes to accumulo? Think byte order for lexical sorting, > fixed vs variable length, etc. Perhaps it would not be unreasonable to > support explicitly stating a character encoding on table creation? > > Drew > On Oct 29, 2012 8:47 PM, "Josh Elser" <[EMAIL PROTECTED]> wrote: > > > +1 Mike. > > > > 1. It would be hard for me to believe Key/Value are ever handled > > internally in terms of Strings, but, if such a case does exist, it would > be > > extremely prudent to fix. > > > > 2. FWIW, the Shell does use ISO-8859-1 as its charset which is referenced > > by other commands [1,2]. It would be good to double check all of the > other > > commands. > > > > [1] https://github.com/apache/**accumulo/blob/trunk/core/src/** > > main/java/org/apache/accumulo/**core/util/shell/Shell.java< > https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/Shell.java > > > > [2] https://github.com/apache/**accumulo/blob/trunk/core/src/** > > main/java/org/apache/accumulo/**core/util/shell/commands/** > > InsertCommand.java< > https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/commands/InsertCommand.java > > > > > > On 10/29/2012 8:27 PM, Michael Flester wrote: > > > >> I agree with Benson entirely with one caveat. It seems to me that there > >> might be two categories of things being discussed > >> > >> 1. User data (keys and values) > >> 2. Ancillary things needed for operation of Accumulo (passwords). > >> > >> These could well be considered separately. Trying to do anything with > >> keys and values other than treating them as bytes all of the time > >> I find quite scary. > >> > >> And if this is only being done to satisfy pmd or findbugs, those tools > >> can be convinced to modify their reporting about this issue. > >> > >> > +
Ed Kohlwey 2012-10-30, 01:54
-
Re: Setting Charset in getBytes() call.Eric Newton 2012-10-30, 20:02
> The current reliance on Text is in my opinion the greatest deficit of the
> client API- Heh... if only that were true! But it is a big wart: requires users' to include the hadoop-core library in client code, which is unfortunate. But it's probably going to be required for the importDirectory() call anyhow. -Eric +
Eric Newton 2012-10-30, 20:02
-
Re: Setting Charset in getBytes() call.Marc Parisi 2012-10-30, 22:28
Instead of taking the platform encoding ( which can be changed as von cloud
suggested ), we're creating a class specific private definition? This isn't particularly dynamic and an override for those who want to use a platform specific encoding. This is a maintenance nightmare. I think before you move forward with it, think of someone who doesn't want this change and how difficult it would be to utilize their own or a different encoding scheme. At the very least you can create a singleton that creates a dynamic encoding for that, which can be modified via the accumulo configuration. On Tue, Oct 30, 2012 at 4:02 PM, Eric Newton <[EMAIL PROTECTED]> wrote: > > The current reliance on Text is in my opinion the greatest deficit of the > > client API- > > > > Heh... if only that were true! > > But it is a big wart: requires users' to include the hadoop-core library in > client code, which is unfortunate. But it's probably going to be required > for the importDirectory() call anyhow. > > -Eric > +
Marc Parisi 2012-10-30, 22:28
-
Re: Setting Charset in getBytes() call.Marc Parisi 2012-10-30, 22:31
sorry, I meant *for that instance
On Tue, Oct 30, 2012 at 6:28 PM, Marc Parisi <[EMAIL PROTECTED]> wrote: > Instead of taking the platform encoding ( which can be changed as von > cloud suggested ), we're creating a class specific private definition? This > isn't particularly dynamic and an override for those who want to use a > platform specific encoding. This is a maintenance nightmare. I think before > you move forward with it, think of someone who doesn't want this change and > how difficult it would be to utilize their own or a different encoding > scheme. > > At the very least you can create a singleton that creates a dynamic > encoding for that, which can be modified via the accumulo configuration. > > > On Tue, Oct 30, 2012 at 4:02 PM, Eric Newton <[EMAIL PROTECTED]>wrote: > >> > The current reliance on Text is in my opinion the greatest deficit of >> the >> > client API- >> >> >> >> Heh... if only that were true! >> >> But it is a big wart: requires users' to include the hadoop-core library >> in >> client code, which is unfortunate. But it's probably going to be required >> for the importDirectory() call anyhow. >> >> -Eric >> > > +
Marc Parisi 2012-10-30, 22:31
-
Re: Setting Charset in getBytes() call.Benson Margulies 2012-10-30, 23:26
UTF-8 allows any String to have to go into the database and get back
out. Are there actually use cases in which some application code pushes in strings, and another pulls out bytes, and would be perturbed to find UTF-8 as opposed to some other encoding in the bytes? +
Benson Margulies 2012-10-30, 23:26
|