Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # dev - Setting Charset in getBytes() call.


Copy link to this message
-
Re: Setting Charset in getBytes() call.
Adam Fuchs 2012-10-30, 20:26
On Mon, Oct 29, 2012 at 9:22 PM, Drew Farris <[EMAIL PROTECTED]> wrote:

> I have always wondered if there were cases in the API where users are
> forced to use Text when they would otherwise prefer byte[], e.g: stuffing a
> non utf8 byte[] into a Text object to facilitate storage or sorting. Not
> entirely sure whether Text would complain if this were the case. I suspect
> we should seek to elimimate these if they currently exist.
>

The Text class is essentially a wrapper around a byte[], with some
convenience methods for translating to/from other types. Accumulo only ever
reads bytes out of it, so there is no encoding problem there. We also don't
use most of its convenience methods. Many people see that it is named
"Text" and assume that it only stores human readable text, but that is not
the case. It probably should have been named
"ConvenientByteArrayWrapperWithSomeMemoryEfficiencySupportAndStringOrientedTranslationMethodsThatIsWritableComparable".

I also agree that it would be good to get rid of the reliance on Hadoop's
Text object, especially because people often do not respect getLength() on
the byte[] obtained from getBytes().

Adam