Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # dev - Setting Charset in getBytes() call.


Copy link to this message
-
Re: Setting Charset in getBytes() call.
Benson Margulies 2012-11-02, 19:56
Maybe I'm being particularly dense, but I still think that this is
being made too complex by failing to enumerate the specific goals.

First case; data for which Accumulo is defined to persistently store
*characters*, as opposed to bytes. I would hope that, in all such
cases, we would agree that those characters should be stored in some
Unicode format, never in some legacy encoding.

Second case; data for which Accumulo is defined to store bytes, but,
for convenience, an API allows the user to read and write characters.
In this case, I can imagine two competing API designs. One would be to
mirror Java, and in all such cases give the user the option of
specifying the charset, defaulting to file.encoding. The other would
be to insist on UTF-8. A third possibility - to just respect
file.encoding - seems to me to be retreading the errors of Java 1.x.

Third case; cases in which the user either supplies a text file for
Accumulo to read, or asks Accumulo to write a text file. Having an API
that can default to file.encoding here would be convenient for users,
who want files in their platform's default encoding. Note that this is
incompatible with the notion of *setting* file.encoding as an
implementation technique for getting the string constructor and
getBytes() to do UTF-8.

Finally for today, I had a hard time following the response to my
writing on servlets. I'll vastly simplify my presentation: when a user
of Accumulo writes Java code that calls the Accumulo API, I find it
unacceptable to require that user to set file.encoding to get correct
behavior from Accumulo, except as described in the second case above.
When Accumulo classes are integrated into user applications, Accumulo
must respect file.encoding, or ignore file.encoding, but it cannot
require the user to set it to something in particular to get correct
behavior.