Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # dev >> Shell Charset?


Copy link to this message
-
Re: Shell Charset?
Would a better long-term solution be to just deal with it in a new shell
that actually supports all sorts of constructs outside of the current
shell commands?

I'm thinking of Python where you have the ability to specify things like
u'\0000'. The proxy would certainly drop the barrier of doing something
like this.

Would that be overkill to work towards in 1.6? Does this merit fixing
sooner?

On 5/6/13 2:09 PM, Keith Turner wrote:
> On Sun, May 5, 2013 at 6:49 PM, Drew Farris <[EMAIL PROTECTED]> wrote:
>
>> In o.a.a.core.uti.shell.commands.OptUtil, I notice that getStartRow and
>> getEndRow, use the following snippet to read their arguments:
>>
>> new Text(cl.getOptionValue(END_ROW_OPT).getBytes(Shell.CHARSET));
>>
>> Here, Shell.CHARSET is set to ISO-8859-1
>>
>> This seems to mean that if I use UTF-8 characters (unescaped) from the
>> shell to set my begin or end row, that I will not get what I expect because
>> the conversion from String to bytes would be performed using the incorrect
>> character set.
>>
>> For example, in the following snippet, testIso fails while testUTF succeeds
>> (when the encoding of the source file is UTF-8):
>>
>>
>>    @Test
>>
>>    public void testISO() throws Exception {
>>
>>      String s = "本条目是介紹";
>>
>>      String charset = "ISO-8859-1";
>>
>>      Text t = new Text(s.getBytes(charset));
>>
>>      Assert.assertEquals(s, t.toString());
>>
>>    }
>>
>>
>>    @Test
>>
>>    public void testUTF() throws Exception {
>>
>>      String s = "本条目是介紹";
>>
>>      String charset = "UTF-8";
>>
>>      Text t = new Text(s.getBytes(charset));
>>
>>      Assert.assertEquals(s, t.toString());
>>
>>    }
>>
>>
>> Possibly this should be locale dependent behavior? Also, perhaps I'm
>> missing the fact that the Shell is not supposed to support UTF-8 characters
>> in start and end ranges, and users must escape their strings appropriately.
>> (Which would be a bit of a pain).
>>
> I think the way the shell is written, it pushes binary data (that may not
> be UTF-8) through strings.  I think at some point the \xNN escape codes are
> converted to binary and this data is pushed back into a String.
>    ISO-8859-1 ensures this works.   Ideally the shell would not do this.
>
>
>>
>> - Drew
>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB