|
|
-
Re: Using Hadoop infrastructure with input streams instead of key/value inputSteve Lewis 2012-12-03, 17:06
I presume a single file is handled by one and only one mapper. in that case
you can pass the path as a string and do something like this public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { String hdfspath = value.toString(); final FileSystem fs = FileSystem.get(context.getConfiguration()); Path src = new Path(hdfsPath); InputStream is = null; try { is = fs.open(src); ... handle Stream } finally { if(is != null) is.close(); } You might try streaming to a C program On Mon, Dec 3, 2012 at 8:22 AM, Wheeler, Bill NPO < [EMAIL PROTECTED]> wrote: > I am trying to use Hadoop’s partitioning/scheduling/storage > infrastructure to process many HDFS files of data in parallel (1 HDFS file > per map task), but in a way that does not naturally fit into the key/value > pair input framework. Specifically my application’s “map” function > equivalent does not want to receive formatted data as key/value > pairs—instead, I’d like to receive a Hadoop input stream object for my map > processing so that I can read bytes out in many different ways with much > greater flexibility and efficiency than what I’d get with the key/value > pair input constraint. The input stream would handle the complexity of > fetching local and remote HDFS data blocks as needed on my behalf. The > result of the map processing would then conform to key/value pair map > outputs and be subsequently processed by traditional reduce code.**** > > ** ** > > I’m guessing that I am not the only person who would like to read HDFS > file input directly as this capability could open up a new type of Hadoop > use models. Is there any support for acquiring input streams directly into > java map code? And is there any support for doing the same into C++ map > code ala Pipes?**** > > ** ** > > For added context, my application is in the video analytic space, > requiring me to read video files . I have implemented a solution, but it > is a hack with less than ideal characteristics: I have RecordReader code > which simply passes the HDFS filename thru in the key field of my key/value > input. I’m using Pipes to implement the map function in C++ code. The C++ > map code then performs a system call, “hadoop fs –copyToLocal hdfs_filename > local_filename” to put the entire HDFS file on the datanode’s local file > system where it is readable by C++ IO calls. I then simply open up this > file and process it. It would be much better to avoid having to do all the > extra IO associated with “copyToLocal” and instead somehow receive an input > stream object from which to directly read from HDFS.**** > > ** ** > > Any way of doing this in a more elegant fashion?**** > > ** ** > > Thanks,**** > > Bill**** > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com |