1. MR uses the host hints provided by an InputSplit object's
getLocations return. 
2. Typically, (1) is populated by taking each block's locations,
available via the FileSystem's FileStatus objects queried over a block
of a file .
3. You need something custom like a WholeFileInputFormat, whose
getSplits returns one complete FileSplit per file to process by the
job, and the file split object created includes the ordered location
hints gotten and applied per (2) and (1).
5. Your input format's record reader need not read them into a byte
as the WholeFileInputFormat does, but you may instead have it return a
more efficient or dummy record reader that better suits your map
processing logic/needs. This is handled via what you choose to return
4. Additionally, you may use a large block size factor for your files
loaded to HDFS, if your file format itself isn't splittable. This is
to avoid the micro-overhead of managing several locations of several
blocks under the same file when designing the location returns for
 - http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/FileSplit.html#getLocations()
 - http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.Path,%20long,%20long)
 - http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/InputFormat.html#createRecordReader(org.apache.hadoop.mapreduce.InputSplit,%20org.apache.hadoop.mapreduce.TaskAttemptContext)
Does this help?
On Wed, May 1, 2013 at 2:55 PM, Julian Bui <[EMAIL PROTECTED]> wrote:
> Hello hadoop users,
> I have a library that takes a string as input and finds the file on the HDFS
> and performs operations on it...but at the moment this doesn't take
> advantage of node awareness; it may or may not run on the node with the
> data. I'd like to fix this.
> So a little more about the library I mentioned. I modified an imagery
> library to take in a string URL and fetches the input stream corresponding
> to that URL on the HDFS instead of a typical file system. So it'd take in
> the string "hdfs://blahblahblah/image.bmp" and now the library maintains a
> reference to this file's input stream and can do things to this image.
> The problem is that I pass to the MapReduce application a string list of
> these images and these URLs get passed to the HDFS-ified library but these
> URLs may or may not be on the task node and so I don't take advantage of
> locality because computation isn't with the data. What are my best options
> for taking advantage of node awareness in this situation? I was thinking
> what my options are...
> ***Brainstorming solutions***
> One possible one (not sure???) is to use WholeFileInputFormat as described
> in OReilly's book but this really takes a file and gives you the byte array
> that represents the file and I'm not sure I want this because some of my
> files can be a couple of gigabytes in size (though typically ~200MB) and can
> exhaust the memory.
> Another option would be to take better advantage of the HDFS-ified library
> so I'd have to get my task to interpret the string URL and determine which
> node that file exists on and then execute ON that node. I have no clue how
> I'd go about doing this.
> I'm sure there are other options as well but I'm just not that familiar with
> hadoop to know and I was hoping someone out there might be able to help me
> Thanks in advance,