Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> executing with the data but through using a file system interface


Copy link to this message
-
executing with the data but through using a file system interface
Hello hadoop users,

I have a library that takes a string as input and finds the file on the
HDFS and performs operations on it...but at the moment this doesn't take
advantage of node awareness; it may or may not run on the node with the
data.  I'd like to fix this.

***Background***
So a little more about the library I mentioned.  I modified an imagery
library to take in a string URL and fetches the input stream corresponding
to that URL on the HDFS instead of a typical file system.  So it'd take in
the string "hdfs://blahblahblah/image.bmp" and now the library maintains a
reference to this file's input stream and can do things to this image.

***Problem***
The problem is that I pass to the MapReduce application a string list of
these images and these URLs get passed to the HDFS-ified library but these
URLs may or may not be on the task node and so I don't take advantage of
locality because computation isn't with the data.  What are my best options
for taking advantage of node awareness in this situation?  I was thinking
what my options are...

***Brainstorming solutions***
One possible one (not sure???) is to use WholeFileInputFormat as described
in OReilly's book but this really takes a file and gives you the byte array
that represents the file and I'm not sure I want this because some of my
files can be a couple of gigabytes in size (though typically ~200MB) and
can exhaust the memory.

Another option would be to take better advantage of the HDFS-ified library
so I'd have to get my task to interpret the string URL and determine which
node that file exists on and then execute ON that node.  I have no clue how
I'd go about doing this.

I'm sure there are other options as well but I'm just not that familiar
with hadoop to know and I was hoping someone out there might be able to
help me out.

Thanks in advance,
-Julian
+
Harsh J 2013-05-01, 09:48