|
|
+
Joshi, Rekha 2012-09-11, 08:01
-
Re: Non utf-8 chars in inputAjay Srivastava 2012-09-11, 08:21
Rekha,
I guess that problem is that Text class uses utf-8 encoding and one can not set other encoding for this class. I have not seen any other Text like class which supports other encoding otherwise I have written my custom input format class. Thanks for your inputs. Regards, Ajay Srivastava On 11-Sep-2012, at 1:31 PM, Joshi, Rekha wrote: > Actually even if that works, it does not seem an ideal solution. > > I think format and encoding are distinct, and enforcing format must not > enforce an encoding.So that means there must be a possibility to pass > encoding as a user choice on construction, > e.g.:TextInputFormat("your-encoding"). > But I do not see that in api, so even if I extend > InputFormat/RecordReader, I will not be able to have a feature of > setEncoding() on my file format.Having that would be a good solution. > > Thanks > Rekha > > On 11/09/12 12:37 PM, "Joshi, Rekha" <[EMAIL PROTECTED]> wrote: > >> Hi Ajay, >> >> Try SequenceFileAsBinaryInputFormat ? >> >> >> Thanks >> Rekha >> >> On 11/09/12 11:24 AM, "Ajay Srivastava" <[EMAIL PROTECTED]> >> wrote: >> >>> Hi, >>> >>> I am using default inputFormat class for reading input from text files >>> but the input file has some non utf-8 characters. >>> I guess that TextInputFormat class is default inputFormat class and it >>> replaces these non utf-8 chars by "\uFFFD". If I do not want this >>> behavior and need actual char in my mapper what should be the correct >>> inputFormat class ? >>> >>> >>> >>> Regards, >>> Ajay Srivastava >> > |