Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Mapreduce jobs to download job input from across the internet


Copy link to this message
-
Mapreduce jobs to download job input from across the internet
For a set of jobs to run I need to download about 100GB of data from the
internet (~1000 files of varying sizes from ~10 different domains).

 

Currently I do this in a simple linux script as it's easy to script FTP,
curl, and the like. But it's a mess to maintain a separate server for that
process. I'd rather it run in mapreduce. Just give it a bill of materials
and let it go about downloading it, retrying as necessary to deal with iffy
network conditions.

 

I wrote one such job to craw images we need to acquire, and it was the
royalist of royal pains. I wonder if there are any good approaches to this
kind of data acquisition task in Hadoop. It would certainly be nicer just to
schedule a data-acquisition job ahead of the processing jobs in Oozie rather
than try to maintain synchronization between the download processes and the
jobs.

 

Ideas?

 

+
Peyman Mohajerian 2013-04-17, 16:41
+
Marcos Luis Ortiz Valmase... 2013-04-17, 19:59