Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> executing with the data but through using a file system interface


Copy link to this message
-
executing with the data but through using a file system interface
Hello hadoop users,

I have a library that takes a string as input and finds the file on the
HDFS and performs operations on it...but at the moment this doesn't take
advantage of node awareness; it may or may not run on the node with the
data.  I'd like to fix this.

***Background***
So a little more about the library I mentioned.  I modified an imagery
library to take in a string URL and fetches the input stream corresponding
to that URL on the HDFS instead of a typical file system.  So it'd take in
the string "hdfs://blahblahblah/image.bmp" and now the library maintains a
reference to this file's input stream and can do things to this image.

***Problem***
The problem is that I pass to the MapReduce application a string list of
these images and these URLs get passed to the HDFS-ified library but these
URLs may or may not be on the task node and so I don't take advantage of
locality because computation isn't with the data.  What are my best options
for taking advantage of node awareness in this situation?  I was thinking
what my options are...

***Brainstorming solutions***
One possible one (not sure???) is to use WholeFileInputFormat as described
in OReilly's book but this really takes a file and gives you the byte array
that represents the file and I'm not sure I want this because some of my
files can be a couple of gigabytes in size (though typically ~200MB) and
can exhaust the memory.

Another option would be to take better advantage of the HDFS-ified library
so I'd have to get my task to interpret the string URL and determine which
node that file exists on and then execute ON that node.  I have no clue how
I'd go about doing this.

I'm sure there are other options as well but I'm just not that familiar
with hadoop to know and I was hoping someone out there might be able to
help me out.

Thanks in advance,
-Julian
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB