Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # dev - Shell Charset?


Copy link to this message
-
Re: Shell Charset?
John Vines 2013-05-06, 14:55
Sounds like we should grep through the codebase and make sure the only
charset we're using is UTF-8...
10
On Sun, May 5, 2013 at 8:08 PM, Christopher <[EMAIL PROTECTED]> wrote:

> The shell should accept java "String" from the the console (leaving
> the job of converting input bytes to a java String argument to the
> locale-dependent console), and should only translate them to UTF-8
> when it sends it to Accumulo, I think.
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Sun, May 5, 2013 at 6:49 PM, Drew Farris <[EMAIL PROTECTED]> wrote:
> > In o.a.a.core.uti.shell.commands.OptUtil, I notice that getStartRow and
> > getEndRow, use the following snippet to read their arguments:
> >
> > new Text(cl.getOptionValue(END_ROW_OPT).getBytes(Shell.CHARSET));
> >
> > Here, Shell.CHARSET is set to ISO-8859-1
> >
> > This seems to mean that if I use UTF-8 characters (unescaped) from the
> > shell to set my begin or end row, that I will not get what I expect
> because
> > the conversion from String to bytes would be performed using the
> incorrect
> > character set.
> >
> > For example, in the following snippet, testIso fails while testUTF
> succeeds
> > (when the encoding of the source file is UTF-8):
> >
> >
> >   @Test
> >
> >   public void testISO() throws Exception {
> >
> >     String s = "本条目是介紹";
> >
> >     String charset = "ISO-8859-1";
> >
> >     Text t = new Text(s.getBytes(charset));
> >
> >     Assert.assertEquals(s, t.toString());
> >
> >   }
> >
> >
> >   @Test
> >
> >   public void testUTF() throws Exception {
> >
> >     String s = "本条目是介紹";
> >
> >     String charset = "UTF-8";
> >
> >     Text t = new Text(s.getBytes(charset));
> >
> >     Assert.assertEquals(s, t.toString());
> >
> >   }
> >
> >
> > Possibly this should be locale dependent behavior? Also, perhaps I'm
> > missing the fact that the Shell is not supposed to support UTF-8
> characters
> > in start and end ranges, and users must escape their strings
> appropriately.
> > (Which would be a bit of a pain).
> >
> >
> > - Drew
>