Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Re: Non utf-8 chars in input


Copy link to this message
-
Re: Non utf-8 chars in input
Rekha,

I guess that problem is that Text class uses utf-8 encoding and one can not set other encoding for this class.
I have not seen any other Text like class which supports other encoding otherwise I have written my custom input format class.

Thanks for your inputs.
Regards,
Ajay Srivastava
On 11-Sep-2012, at 1:31 PM, Joshi, Rekha wrote:

> Actually even if that works, it does not seem an ideal solution.
>
> I think format and encoding are distinct, and enforcing format must not
> enforce an encoding.So that means there must be a possibility to pass
> encoding as a user choice on construction,
> e.g.:TextInputFormat("your-encoding").
> But I do not see that in api, so even if I extend
> InputFormat/RecordReader, I will not be able to have a feature of
> setEncoding() on my file format.Having that would be a good solution.
>
> Thanks
> Rekha
>
> On 11/09/12 12:37 PM, "Joshi, Rekha" <[EMAIL PROTECTED]> wrote:
>
>> Hi Ajay,
>>
>> Try SequenceFileAsBinaryInputFormat ?
>>
>>
>> Thanks
>> Rekha
>>
>> On 11/09/12 11:24 AM, "Ajay Srivastava" <[EMAIL PROTECTED]>
>> wrote:
>>
>>> Hi,
>>>
>>> I am using default inputFormat class for reading input from text files
>>> but the input file has some non utf-8 characters.
>>> I guess that TextInputFormat class is default inputFormat class and it
>>> replaces these non utf-8 chars by "\uFFFD". If I do not want this
>>> behavior and need actual char in my mapper what should be the correct
>>> inputFormat class ?
>>>
>>>
>>>
>>> Regards,
>>> Ajay Srivastava
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB