-Re: Running hadoop for processing sources in full sky maps
Sandy Ryza 2013-07-13, 18:05
For copying the full sky map to each node, look up the distributed cache.
It works by placing the sky map file on HDFS and each task will pull it
down when needed. For feeding the input data into Hadoop, what format is
it in currently? One simple way would be to have a text file with the
reference, latitude, and longitude separated by commas on each line, and
then use TextInputFormat.
On Fri, Jul 12, 2013 at 2:43 PM, andrea zonca <[EMAIL PROTECTED]>wrote:
> I have few tens of full sky maps, in binary format (FITS) of about 600MB
> For each sky map I already have a catalog of the position of few
> thousand sources, i.e. stars, galaxies, radio sources.
> For each source I would like to:
> open the full sky map
> extract the relevant section, typically 20MB or less
> run some statistics on them
> aggregate the outputs to a catalog
> I would like to run hadoop, possibly using python via the streaming
> interface, to process them in parallel.
> I think the input to the mapper should be each record of the catalogs,
> then the python mapper can open the full sky map, do the processing
> and print the output to stdout.
> Is this a reasonable approach?
> If so, I need to be able to configure hadoop so that a full sky map is
> copied locally to the nodes that are processing one of its sources.
> How can I achieve that?
> Also, what is the best way to feed the input data to hadoop? for each
> source I have a reference to the full sky map, latitude and longitude
> I posted this question on StackOverflow:
> Andrea Zonca