Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo, mail # dev - Setting Charset in getBytes() call.


+
David Medinets 2012-10-28, 21:50
+
Ed Kohlwey 2012-10-28, 22:18
+
William Slacum 2012-10-29, 15:39
+
David Medinets 2012-10-29, 16:00
+
Josh Elser 2012-10-29, 16:21
+
Benson Margulies 2012-10-29, 16:24
+
John Vines 2012-10-29, 16:42
+
Josh Elser 2012-10-29, 16:57
+
David Medinets 2012-10-29, 17:00
+
William Slacum 2012-10-29, 17:13
+
Mike Drob 2012-10-29, 17:16
+
Michael Flester 2012-10-29, 19:14
+
John Vines 2012-10-29, 19:18
+
Benson Margulies 2012-10-29, 20:02
+
David Medinets 2012-10-29, 20:29
+
Michael Flester 2012-10-30, 00:27
+
Josh Elser 2012-10-30, 00:46
+
Benson Margulies 2012-10-30, 00:54
+
Josh Elser 2012-10-30, 01:57
+
John Vines 2012-10-30, 02:08
+
David Medinets 2012-10-30, 02:47
+
Josh Elser 2012-10-30, 22:27
+
David Medinets 2012-10-30, 23:47
+
Josh Elser 2012-10-31, 00:21
+
Benson Margulies 2012-10-31, 00:31
+
William Slacum 2012-10-31, 00:41
+
David Medinets 2012-10-31, 02:29
+
John Vines 2012-10-31, 02:35
+
Christopher Tubbs 2012-10-31, 18:02
+
Marc Parisi 2012-11-02, 12:24
+
Benson Margulies 2012-11-02, 19:56
Copy link to this message
-
Re: Setting Charset in getBytes() call.
John Vines 2012-11-02, 20:18
Client/server mismatch is a giant problem. And the more combustibility we
put into Accumulo the closer we get to users hitting a knowledge barrier
about knowing the specifics of their Accumulo instance. i believe there are
two avenues for dealing with this-
1. Avoid at all costs. Unfortunately, this can ultimately boil down to
users losing features because we don't want them to have any sort of
intimate knowledge of the system.
2. A remote configuration utility. If we can have the client code pull the
configuration from the server, perhaps when Connection is made, we can have
our client APIs consistent on both sides of the channel. I believe a
solution like this could handle the issue Benson mentions, but it also
means we cannot approach this encoding issue with file.encoding.

Personally, I think the second option is an inevitability for us as we do
more and more features which are configuration specific. Either way, it
does seem that file.encoding is not sufficient as we want to avoid the
client code requiring some extremely specific documentation. it might even
be an incompatible configuration with what the client wants to do.

I think we are overgeneralizing this issue though. Josh did a decent job
and starting to hammer away on this. It's not just a matter of us doing
weird things with encodings, but the cases for them. For instance, all
zookeeper operations need to be done the same way across the board. This is
needs to be shared knowledge for both servers and clients. So these should
have charset specifications. But other things (pulling things out of thin
air), such as the client api methods, are outside of the purview. Primarily
because they are not associated with any tables until well after they are
created. So that is a user-space burden and should not be a concern with
us. Or any sort of local string operation.

It boils down to if it directly goes into HDFS, zookeeper, or the !METADATA
table then we should enforce encoding, in the way Dave approached it.
Outside of those scopes I think we should really just leave them the hell
alone because the system shouldn't be messing with user's data.

John
On Fri, Nov 2, 2012 at 3:56 PM, Benson Margulies <[EMAIL PROTECTED]>wrote:

> Maybe I'm being particularly dense, but I still think that this is
> being made too complex by failing to enumerate the specific goals.
>
> First case; data for which Accumulo is defined to persistently store
> *characters*, as opposed to bytes. I would hope that, in all such
> cases, we would agree that those characters should be stored in some
> Unicode format, never in some legacy encoding.
>
> Second case; data for which Accumulo is defined to store bytes, but,
> for convenience, an API allows the user to read and write characters.
> In this case, I can imagine two competing API designs. One would be to
> mirror Java, and in all such cases give the user the option of
> specifying the charset, defaulting to file.encoding. The other would
> be to insist on UTF-8. A third possibility - to just respect
> file.encoding - seems to me to be retreading the errors of Java 1.x.
>
> Third case; cases in which the user either supplies a text file for
> Accumulo to read, or asks Accumulo to write a text file. Having an API
> that can default to file.encoding here would be convenient for users,
> who want files in their platform's default encoding. Note that this is
> incompatible with the notion of *setting* file.encoding as an
> implementation technique for getting the string constructor and
> getBytes() to do UTF-8.
>
> Finally for today, I had a hard time following the response to my
> writing on servlets. I'll vastly simplify my presentation: when a user
> of Accumulo writes Java code that calls the Accumulo API, I find it
> unacceptable to require that user to set file.encoding to get correct
> behavior from Accumulo, except as described in the second case above.
> When Accumulo classes are integrated into user applications, Accumulo
> must respect file.encoding, or ignore file.encoding, but it cannot
+
Christopher Tubbs 2012-11-03, 01:54
+
David Medinets 2012-11-03, 03:34
+
Josh Elser 2012-11-02, 23:34
+
Drew Farris 2012-10-30, 01:22
+
Adam Fuchs 2012-10-30, 20:26
+
Ed Kohlwey 2012-10-30, 01:44
+
Ed Kohlwey 2012-10-30, 01:54
+
Eric Newton 2012-10-30, 20:02
+
Marc Parisi 2012-10-30, 22:28
+
Marc Parisi 2012-10-30, 22:31
+
Benson Margulies 2012-10-30, 23:26