Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # dev >> Setting Charset in getBytes() call.


Copy link to this message
-
Re: Setting Charset in getBytes() call.
Accumulo may not be just a set of servers, but it is designed to be a set
of processes, which means having their own JVM. I think this mostly boils
down to an issue of API however-- if Accumulo deals with user's data in
terms of bytes, then this issue is put back on the user, which I'm fine
with as a trade off between configuration versus convention.

There are other cases beyond simply a client API, though, namely
configuration. I'm more comfortable with enforcing some standard there.

On Tue, Oct 30, 2012 at 8:31 PM, Benson Margulies <[EMAIL PROTECTED]>wrote:

> On Tue, Oct 30, 2012 at 8:21 PM, Josh Elser <[EMAIL PROTECTED]> wrote:
> > On 10/30/2012 7:47 PM, David Medinets wrote:
> >>>
> >>> My issue with this is that you have now hard-coded the fact that
> everyone
> >>> else is going to use UTF-8.
> >>
> >>
> >> Who is everyone else? I agree that I have hard-coded the use of UTF-8.
> >> On the other hand, I've merely codified an existing practice. Thus the
> >> issue is now exposed, the places the convention is used are defined.
> >> Once a consensus is reached, we can implement it with confidence.
> >
> >
> > "Everyone else" is everyone who builds Accumulo since you committed your
> > changes and uses it. Ignoring that, forcing a single charset isn't the
> big
> > issue here (as we've *all* agreed that UTF-8 should not cause any
> > data-correctness issues) so for now I'll just drop it as it's just
> creating
> > confusion.
> >
> > My issue is *how* you implemented the default charset. We already have 3
> > people (Marc, Bill and myself) who have stated that we believe inline
> > charset declaration is not the correct implementation and that using the
> JVM
> > property is the better implementation.
> >
> > I'd encourage others to weigh in to form a complete consensus and shift
> the
> > discussion to that implementation if needed.
> >
> >>
> >>> way to fix the problem. I still contest that setting the desired
> encoding
> >>> (via the appropriate JVM property like Bill Slacum initial suggested)
> is
> >>> the
> >>> proper way to address the issue.
> >>
> >>
> >> It is easy to do both. Create a ByteEncodingInitializer (or somesuch)
> >> class that reads the JVM property and defines a globally used Charset.
> >> The find those utf8 definitions and usages and replace them with the
> >> globally-defined value.
> >
> >
> > Again, by setting the 'file.encoding' JVM parameter, such a class is
> > unnecessary because it should be handled internal to Java. For Oracle/Sun
> > JDK and OpenJDK, setting the "file.encoding" parameter at run time will
> use
> > the provided charset you wanted without actually changing any code.
>
> If Accumulo was only a pile of servers, you could do this. You could
> say that part of the configuration process for the servers is to
> specify the desired encoding to file.encoding, and your shell scripts
> could set UTF-8 by default.
>
> But Accumulo is *not* just a pile of servers. Setting file.encoding
> effects the entire JVM. A webapp that uses Accumulo now would need to
> have the entire servlet container have a particular setting of
> file.encoding. This just does not work in the wild. Even without the
> servlet container issue, a user of Accumulo may need to plug it into
> an existing code base that has other reasons to set file.encoding, and
> will not like it when Accumulo starts to corrupt his or her string
> data.
>