|
|
-
Using Hadoop infrastructure with input streams instead of key/value input
Wheeler, Bill NPO 2012-12-03, 16:22
I am trying to use Hadoop's partitioning/scheduling/storage infrastructure to process many HDFS files of data in parallel (1 HDFS file per map task), but in a way that does not naturally fit into the key/value pair input framework. Specifically my application's "map" function equivalent does not want to receive formatted data as key/value pairs-instead, I'd like to receive a Hadoop input stream object for my map processing so that I can read bytes out in many different ways with much greater flexibility and efficiency than what I'd get with the key/value pair input constraint. The input stream would handle the complexity of fetching local and remote HDFS data blocks as needed on my behalf. The result of the map processing would then conform to key/value pair map outputs and be subsequently processed by traditional reduce code.
I'm guessing that I am not the only person who would like to read HDFS file input directly as this capability could open up a new type of Hadoop use models. Is there any support for acquiring input streams directly into java map code? And is there any support for doing the same into C++ map code ala Pipes?
For added context, my application is in the video analytic space, requiring me to read video files . I have implemented a solution, but it is a hack with less than ideal characteristics: I have RecordReader code which simply passes the HDFS filename thru in the key field of my key/value input. I'm using Pipes to implement the map function in C++ code. The C++ map code then performs a system call, "hadoop fs -copyToLocal hdfs_filename local_filename" to put the entire HDFS file on the datanode's local file system where it is readable by C++ IO calls. I then simply open up this file and process it. It would be much better to avoid having to do all the extra IO associated with "copyToLocal" and instead somehow receive an input stream object from which to directly read from HDFS.
Any way of doing this in a more elegant fashion?
Thanks, Bill
-
Re: Using Hadoop infrastructure with input streams instead of key/value input
Steve Lewis 2012-12-03, 17:06
I presume a single file is handled by one and only one mapper. in that case you can pass the path as a string and do something like this
public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { String hdfspath = value.toString(); final FileSystem fs = FileSystem.get(context.getConfiguration()); Path src = new Path(hdfsPath); InputStream is = null; try { is = fs.open(src); ... handle Stream } finally { if(is != null) is.close(); }
You might try streaming to a C program On Mon, Dec 3, 2012 at 8:22 AM, Wheeler, Bill NPO < [EMAIL PROTECTED]> wrote:
> I am trying to use Hadoop’s partitioning/scheduling/storage > infrastructure to process many HDFS files of data in parallel (1 HDFS file > per map task), but in a way that does not naturally fit into the key/value > pair input framework. Specifically my application’s “map” function > equivalent does not want to receive formatted data as key/value > pairs—instead, I’d like to receive a Hadoop input stream object for my map > processing so that I can read bytes out in many different ways with much > greater flexibility and efficiency than what I’d get with the key/value > pair input constraint. The input stream would handle the complexity of > fetching local and remote HDFS data blocks as needed on my behalf. The > result of the map processing would then conform to key/value pair map > outputs and be subsequently processed by traditional reduce code.**** > > ** ** > > I’m guessing that I am not the only person who would like to read HDFS > file input directly as this capability could open up a new type of Hadoop > use models. Is there any support for acquiring input streams directly into > java map code? And is there any support for doing the same into C++ map > code ala Pipes?**** > > ** ** > > For added context, my application is in the video analytic space, > requiring me to read video files . I have implemented a solution, but it > is a hack with less than ideal characteristics: I have RecordReader code > which simply passes the HDFS filename thru in the key field of my key/value > input. I’m using Pipes to implement the map function in C++ code. The C++ > map code then performs a system call, “hadoop fs –copyToLocal hdfs_filename > local_filename” to put the entire HDFS file on the datanode’s local file > system where it is readable by C++ IO calls. I then simply open up this > file and process it. It would be much better to avoid having to do all the > extra IO associated with “copyToLocal” and instead somehow receive an input > stream object from which to directly read from HDFS.**** > > ** ** > > Any way of doing this in a more elegant fashion?**** > > ** ** > > Thanks,**** > > Bill**** >
-- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
|
|
All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by
Sematext