Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Using Hadoop infrastructure with input streams instead of key/value input

Copy link to this message
Using Hadoop infrastructure with input streams instead of key/value input
I am trying to use Hadoop's partitioning/scheduling/storage infrastructure to process many HDFS files of data in parallel (1 HDFS file per map task), but in a way that does not naturally fit into the key/value pair input framework.  Specifically my application's "map" function equivalent does not want to receive formatted data as key/value pairs-instead, I'd like to receive a Hadoop input stream object for my map processing so that I can read bytes out in many different ways with much greater flexibility and efficiency than what I'd get with the key/value pair input constraint.  The input stream would handle the complexity of fetching local and remote HDFS data blocks as needed on my behalf.  The result of the map processing would then conform to key/value pair map outputs and be subsequently processed by traditional reduce code.

I'm guessing that I am not the only person who would like to read HDFS file input directly as this capability could open up a new type of Hadoop use models.  Is there any support for acquiring input streams directly into java map code?  And is there any support for doing the same into C++ map code ala Pipes?

For added context, my application is in the video analytic space, requiring me to read video files .  I have implemented a solution, but it is a hack with less than ideal characteristics:  I have RecordReader code which simply passes the HDFS filename thru in the key field of my key/value input.  I'm using Pipes to implement the map function in C++ code.  The C++ map code then performs a system call, "hadoop fs -copyToLocal hdfs_filename local_filename" to put the entire HDFS file on the datanode's local file system where it is readable by C++ IO calls.  I then simply open up this file and process it.  It would be much better to avoid having to do all the extra IO associated with "copyToLocal" and instead somehow receive an input stream object from which to directly read from HDFS.

Any way of doing this in a more elegant fashion?