Hadoop, mail # user - Direct HDFS access from a streaming job

Direct HDFS access from a streaming job
Keith Wiley 2011-03-24, 05:26
How do I process files, one per map?

As an example, consider the problem of zipping (compressing) a set of files across the hadoop cluster. You can achieve this using either of these methods:

• Hadoop Streaming and custom mapper script:
• Generate a file containing the full HDFS path of the input files. Each map task would get one file name as input.
• Create a mapper script which, given a filename, will get the file to local disk, gzip the file and put it back in the desired output directory

I'm not trying to gzip files as in the example, but I would like to read files directly from HDFS into C++ streaming code, as opposed to passing those files as input through the streaming input interface (stdin).

I'm not sure how to reference HDFS from C++ though.  I mean, how would one open an ifstream to such a file?

Keith Wiley     [EMAIL PROTECTED]     keithwiley.com    music.keithwiley.com

"Luminous beings are we, not this crude matter."
                                           --  Yoda