|
|
-
Re: Using Hadoop infrastructure with input streams instead of key/value inputHemanth Yamijala 2012-12-04, 08:36
Hi,
I have not tried this myself before, but would libhdfs help ? http://hadoop.apache.org/docs/stable/libhdfs.html Thanks Hemanth On Mon, Dec 3, 2012 at 9:52 PM, Wheeler, Bill NPO < [EMAIL PROTECTED]> wrote: > I am trying to use Hadoop’s partitioning/scheduling/storage > infrastructure to process many HDFS files of data in parallel (1 HDFS file > per map task), but in a way that does not naturally fit into the key/value > pair input framework. Specifically my application’s “map” function > equivalent does not want to receive formatted data as key/value > pairs—instead, I’d like to receive a Hadoop input stream object for my map > processing so that I can read bytes out in many different ways with much > greater flexibility and efficiency than what I’d get with the key/value > pair input constraint. The input stream would handle the complexity of > fetching local and remote HDFS data blocks as needed on my behalf. The > result of the map processing would then conform to key/value pair map > outputs and be subsequently processed by traditional reduce code.**** > > ** ** > > I’m guessing that I am not the only person who would like to read HDFS > file input directly as this capability could open up a new type of Hadoop > use models. Is there any support for acquiring input streams directly into > java map code? And is there any support for doing the same into C++ map > code ala Pipes?**** > > ** ** > > For added context, my application is in the video analytic space, > requiring me to read video files . I have implemented a solution, but it > is a hack with less than ideal characteristics: I have RecordReader code > which simply passes the HDFS filename thru in the key field of my key/value > input. I’m using Pipes to implement the map function in C++ code. The C++ > map code then performs a system call, “hadoop fs –copyToLocal hdfs_filename > local_filename” to put the entire HDFS file on the datanode’s local file > system where it is readable by C++ IO calls. I then simply open up this > file and process it. It would be much better to avoid having to do all the > extra IO associated with “copyToLocal” and instead somehow receive an input > stream object from which to directly read from HDFS.**** > > ** ** > > Any way of doing this in a more elegant fashion?**** > > ** ** > > Thanks,**** > > Bill**** > |