Hello hadoop users,
I have a library that takes a string as input and finds the file on the
HDFS and performs operations on it...but at the moment this doesn't take
advantage of node awareness; it may or may not run on the node with the
data. I'd like to fix this.
So a little more about the library I mentioned. I modified an imagery
library to take in a string URL and fetches the input stream corresponding
to that URL on the HDFS instead of a typical file system. So it'd take in
the string "hdfs://blahblahblah/image.bmp" and now the library maintains a
reference to this file's input stream and can do things to this image.
The problem is that I pass to the MapReduce application a string list of
these images and these URLs get passed to the HDFS-ified library but these
URLs may or may not be on the task node and so I don't take advantage of
locality because computation isn't with the data. What are my best options
for taking advantage of node awareness in this situation? I was thinking
what my options are...
One possible one (not sure???) is to use WholeFileInputFormat as described
in OReilly's book but this really takes a file and gives you the byte array
that represents the file and I'm not sure I want this because some of my
files can be a couple of gigabytes in size (though typically ~200MB) and
can exhaust the memory.
Another option would be to take better advantage of the HDFS-ified library
so I'd have to get my task to interpret the string URL and determine which
node that file exists on and then execute ON that node. I have no clue how
I'd go about doing this.
I'm sure there are other options as well but I'm just not that familiar
with hadoop to know and I was hoping someone out there might be able to
help me out.
Thanks in advance,