Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # dev - Setting Charset in getBytes() call.


Copy link to this message
-
Re: Setting Charset in getBytes() call.
Benson Margulies 2012-10-30, 00:54
On Mon, Oct 29, 2012 at 8:46 PM, Josh Elser <[EMAIL PROTECTED]> wrote:
> +1 Mike.
>
> 1. It would be hard for me to believe Key/Value are ever handled internally
> in terms of Strings, but, if such a case does exist, it would be extremely
> prudent to fix.
>
> 2. FWIW, the Shell does use ISO-8859-1 as its charset which is referenced by
> other commands [1,2]. It would be good to double check all of the other
> commands.

I'm a bit lost. Any possible Java String can be rendered in UTF-8. So,
if you are calling String.getBytes to turn a string into some bytes
for some purpose, I think you need UTF-8.

On the other hand, as Mike pointed out, new String(somebytes, "utf-8")
will destroy data for some byte values that are not, in fact, UTF-8.
By why would Accumulo ever need to string-ify some array of bytes of
uncertain parentage?
>
> [1]
> https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/Shell.java
> [2]
> https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/commands/InsertCommand.java
>
>
> On 10/29/2012 8:27 PM, Michael Flester wrote:
>>
>> I agree with Benson entirely with one caveat. It seems to me that there
>> might be two categories of things being discussed
>>
>>    1. User data (keys and values)
>>    2. Ancillary things needed for operation of Accumulo (passwords).
>>
>> These could well be considered separately. Trying to do anything with
>> keys and values other than treating them as bytes all of the time
>> I find quite scary.
>>
>> And if this is only being done to satisfy pmd or findbugs, those tools
>> can be convinced to modify their reporting about this issue.
>>
>