Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # dev - Shell Charset?


Copy link to this message
-
Shell Charset?
Drew Farris 2013-05-05, 22:49
In o.a.a.core.uti.shell.commands.OptUtil, I notice that getStartRow and
getEndRow, use the following snippet to read their arguments:

new Text(cl.getOptionValue(END_ROW_OPT).getBytes(Shell.CHARSET));

Here, Shell.CHARSET is set to ISO-8859-1

This seems to mean that if I use UTF-8 characters (unescaped) from the
shell to set my begin or end row, that I will not get what I expect because
the conversion from String to bytes would be performed using the incorrect
character set.

For example, in the following snippet, testIso fails while testUTF succeeds
(when the encoding of the source file is UTF-8):
  @Test

  public void testISO() throws Exception {

    String s = "本条目是介紹";

    String charset = "ISO-8859-1";

    Text t = new Text(s.getBytes(charset));

    Assert.assertEquals(s, t.toString());

  }
  @Test

  public void testUTF() throws Exception {

    String s = "本条目是介紹";

    String charset = "UTF-8";

    Text t = new Text(s.getBytes(charset));

    Assert.assertEquals(s, t.toString());

  }
Possibly this should be locale dependent behavior? Also, perhaps I'm
missing the fact that the Shell is not supposed to support UTF-8 characters
in start and end ranges, and users must escape their strings appropriately.
(Which would be a bit of a pain).
- Drew