-Re: how to read binary data from hdfs
Harsh J 2012-05-01, 13:22
Yep this can work, and can be done with the same API I talked about
earlier (With you just opening one file and returning one record -
ditto to Tom White's WholeFile implementation).
How large would your files get though, since you'll be reading it all
into memory? Be careful about that part.
On Tue, May 1, 2012 at 6:45 PM, Amritanshu Shekhar
<[EMAIL PROTECTED]> wrote:
> Thanks for the input. Since my binary input file contains binary data records of fixed format and the file contains fixed number of binary records, wouldn't it be simpler to use FSDataInputStream to read binary data copied to HDFS as a byte array. I can simply copy a file containing HDFS paths to inputDir and a map job would be invoked on each HDFS file. ex:
> FSDataInputStream stm = fileSys.open(filename, 4096);
> byte actual = new byte;
> stm.read(actual, 0, actual.length);
> Let me know if this approach would work and if a potentially better approach exists. I am new to Hadoop so my question might seem too simplistic for some people.
> -----Original Message-----
> From: Harsh J [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, May 01, 2012 6:21 PM
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: Re: how to read binary data from hdfs
> Implement your own custom InputFormat with a RecordReader and you can
> read your files directly.
> To learn how to implement custom readers/formats you can refer to an
> example provided via sub-title "Processing a whole file as a record",
> Page 206 | Chapter 7: MapReduce Types and Formats in Tom White's
> Hadoop: The Definitive Guide, or you can read up the details on
> On Tue, May 1, 2012 at 3:42 PM, Amritanshu Shekhar
> <[EMAIL PROTECTED]> wrote:
>> Hi Guys,
>> I want to read binary data (produced by a C program) that is copied to HDFS using a java program. The idea is that I would write a map-reduce job eventually that would use the aforementioned programs output(the java program would read binary data and create a Java object which the map function would use). I read about the sequence file format that hadoop supports but converting the binary data using java serialization into sequence file format would add another layer of complexity. Is there a simple no frills API that I can use to read binary data directly from HDFS. Any help/resources would be deeply appreciated.
>> Thanks and Regards,
> Harsh J