Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # dev - Shell Charset?


Copy link to this message
-
Re: Shell Charset?
Keith Turner 2013-05-06, 19:42
On Mon, May 6, 2013 at 2:49 PM, Josh Elser <[EMAIL PROTECTED]> wrote:

> Would a better long-term solution be to just deal with it in a new shell
> that actually supports all sorts of constructs outside of the current shell
> commands?
>
> I'm thinking of Python where you have the ability to specify things like
> u'\0000'. The proxy would certainly drop the barrier of doing something
> like this.
>
> Would that be overkill to work towards in 1.6? Does this merit fixing
> sooner?
there is ACCUMULO-1045
>
>
> On 5/6/13 2:09 PM, Keith Turner wrote:
>
>> On Sun, May 5, 2013 at 6:49 PM, Drew Farris <[EMAIL PROTECTED]> wrote:
>>
>>  In o.a.a.core.uti.shell.commands.**OptUtil, I notice that getStartRow
>>> and
>>> getEndRow, use the following snippet to read their arguments:
>>>
>>> new Text(cl.getOptionValue(END_**ROW_OPT).getBytes(Shell.**CHARSET));
>>>
>>> Here, Shell.CHARSET is set to ISO-8859-1
>>>
>>> This seems to mean that if I use UTF-8 characters (unescaped) from the
>>> shell to set my begin or end row, that I will not get what I expect
>>> because
>>> the conversion from String to bytes would be performed using the
>>> incorrect
>>> character set.
>>>
>>> For example, in the following snippet, testIso fails while testUTF
>>> succeeds
>>> (when the encoding of the source file is UTF-8):
>>>
>>>
>>>    @Test
>>>
>>>    public void testISO() throws Exception {
>>>
>>>      String s = "本条目是介紹";
>>>
>>>      String charset = "ISO-8859-1";
>>>
>>>      Text t = new Text(s.getBytes(charset));
>>>
>>>      Assert.assertEquals(s, t.toString());
>>>
>>>    }
>>>
>>>
>>>    @Test
>>>
>>>    public void testUTF() throws Exception {
>>>
>>>      String s = "本条目是介紹";
>>>
>>>      String charset = "UTF-8";
>>>
>>>      Text t = new Text(s.getBytes(charset));
>>>
>>>      Assert.assertEquals(s, t.toString());
>>>
>>>    }
>>>
>>>
>>> Possibly this should be locale dependent behavior? Also, perhaps I'm
>>> missing the fact that the Shell is not supposed to support UTF-8
>>> characters
>>> in start and end ranges, and users must escape their strings
>>> appropriately.
>>> (Which would be a bit of a pain).
>>>
>>>  I think the way the shell is written, it pushes binary data (that may
>> not
>> be UTF-8) through strings.  I think at some point the \xNN escape codes
>> are
>> converted to binary and this data is pushed back into a String.
>>    ISO-8859-1 ensures this works.   Ideally the shell would not do this.
>>
>>
>>
>>> - Drew
>>>
>>>
>