MapReduce, mail # user - Running hadoop for processing sources in full sky maps

Running hadoop for processing sources in full sky maps
andrea zonca 2013-07-12, 21:43

I have few tens of full sky maps, in binary format (FITS) of about 600MB each.

For each sky map I already have a catalog of the position of few
thousand sources, i.e. stars, galaxies, radio sources.

For each source I would like to:

open the full sky map
extract the relevant section, typically 20MB or less
run some statistics on them
aggregate the outputs to a catalog

I would like to run hadoop, possibly using python via the streaming
interface, to process them in parallel.

I think the input to the mapper should be each record of the catalogs,
then the python mapper can open the full sky map, do the processing
and print the output to stdout.

Is this a reasonable approach?
If so, I need to be able to configure hadoop so that a full sky map is
copied locally to the nodes that are processing one of its sources.
How can I achieve that?
Also, what is the best way to feed the input data to hadoop? for each
source I have a reference to the full sky map, latitude and longitude

I posted this question on StackOverflow:

Andrea Zonca