Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # dev >> Shell Charset?


Copy link to this message
-
Re: Shell Charset?
On Sun, May 5, 2013 at 6:49 PM, Drew Farris <[EMAIL PROTECTED]> wrote:

> In o.a.a.core.uti.shell.commands.OptUtil, I notice that getStartRow and
> getEndRow, use the following snippet to read their arguments:
>
> new Text(cl.getOptionValue(END_ROW_OPT).getBytes(Shell.CHARSET));
>
> Here, Shell.CHARSET is set to ISO-8859-1
>
> This seems to mean that if I use UTF-8 characters (unescaped) from the
> shell to set my begin or end row, that I will not get what I expect because
> the conversion from String to bytes would be performed using the incorrect
> character set.
>
> For example, in the following snippet, testIso fails while testUTF succeeds
> (when the encoding of the source file is UTF-8):
>
>
>   @Test
>
>   public void testISO() throws Exception {
>
>     String s = "本条目是介紹";
>
>     String charset = "ISO-8859-1";
>
>     Text t = new Text(s.getBytes(charset));
>
>     Assert.assertEquals(s, t.toString());
>
>   }
>
>
>   @Test
>
>   public void testUTF() throws Exception {
>
>     String s = "本条目是介紹";
>
>     String charset = "UTF-8";
>
>     Text t = new Text(s.getBytes(charset));
>
>     Assert.assertEquals(s, t.toString());
>
>   }
>
>
> Possibly this should be locale dependent behavior? Also, perhaps I'm
> missing the fact that the Shell is not supposed to support UTF-8 characters
> in start and end ranges, and users must escape their strings appropriately.
> (Which would be a bit of a pain).
>

I think the way the shell is written, it pushes binary data (that may not
be UTF-8) through strings.  I think at some point the \xNN escape codes are
converted to binary and this data is pushed back into a String.
  ISO-8859-1 ensures this works.   Ideally the shell would not do this.
>
>
> - Drew
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB