Hi hadoop users,
I'm trying to find out if computation migration is something the developer
needs to worry about or if it's supposed to be hidden.
I would like to use hadoop to take in a list of image paths in the hdfs and
then have each task compress these large, raw images into something much
smaller - say jpeg files.
Input: list of paths
Output: compressed jpeg
Since I don't really need a reduce task (I'm more using hadoop for its
reliability and orchestration aspects), my mapper ought to just take the
list of image paths and then work on them. As I understand it, each image
will likely be on multiple data nodes.
My question is how will each mapper task "migrate the computation" to the
data nodes? I recall reading that the namenode is supposed to deal with
this. Is it hidden from the developer? Or as the developer, do I need to
discover where the data lies and then migrate the task to that node? Since
my input is just a list of paths, it seems like the namenode couldn't
really do this for me.
Another question: Where can I find out more about this? I've looked up
"rack awareness" and "computation migration" but haven't really found much
code relating to either one - leading me to believe I'm not supposed to
have to write code to deal with this.
Anyway, could someone please help me out or set me straight on this?