Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # dev >> Setting Charset in getBytes() call.

Copy link to this message
Re: Setting Charset in getBytes() call.
I have always wondered if there were cases in the API where users are
forced to use Text when they would otherwise prefer byte[], e.g: stuffing a
non utf8 byte[] into a Text object to facilitate storage or sorting. Not
entirely sure whether Text would complain if this were the case. I suspect
we should seek to elimimate these if they currently exist.

Speaking strictly of user data, I agree that fundamentally, every operation
should be based upon byte[]. API methods providing Text and String based
calls should be convience methods where the conversion of text to/from
bytes is handled explicitly (not relying on platform default encoding or
properties) and transparently (doing something sensible when the user
doesn't care or is unaware of the issues surrounding character encoding).

Regarding utf8, is there a need to support arbitrary character encodings
when persisting bytes to accumulo? Think byte order for lexical sorting,
fixed vs variable length, etc. Perhaps it would not be unreasonable to
support explicitly stating a character encoding on table creation?

 On Oct 29, 2012 8:47 PM, "Josh Elser" <[EMAIL PROTECTED]> wrote:

> +1 Mike.
> 1. It would be hard for me to believe Key/Value are ever handled
> internally in terms of Strings, but, if such a case does exist, it would be
> extremely prudent to fix.
> 2. FWIW, the Shell does use ISO-8859-1 as its charset which is referenced
> by other commands [1,2]. It would be good to double check all of the other
> commands.
> [1] https://github.com/apache/**accumulo/blob/trunk/core/src/**
> main/java/org/apache/accumulo/**core/util/shell/Shell.java<https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/Shell.java>
> [2] https://github.com/apache/**accumulo/blob/trunk/core/src/**
> main/java/org/apache/accumulo/**core/util/shell/commands/**
> InsertCommand.java<https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/commands/InsertCommand.java>
> On 10/29/2012 8:27 PM, Michael Flester wrote:
>> I agree with Benson entirely with one caveat. It seems to me that there
>> might be two categories of things being discussed
>>    1. User data (keys and values)
>>    2. Ancillary things needed for operation of Accumulo (passwords).
>> These could well be considered separately. Trying to do anything with
>> keys and values other than treating them as bytes all of the time
>> I find quite scary.
>> And if this is only being done to satisfy pmd or findbugs, those tools
>> can be convinced to modify their reporting about this issue.