Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Mapreduce jobs to download job input from across the internet


Copy link to this message
-
Mapreduce jobs to download job input from across the internet
For a set of jobs to run I need to download about 100GB of data from the
internet (~1000 files of varying sizes from ~10 different domains).

 

Currently I do this in a simple linux script as it's easy to script FTP,
curl, and the like. But it's a mess to maintain a separate server for that
process. I'd rather it run in mapreduce. Just give it a bill of materials
and let it go about downloading it, retrying as necessary to deal with iffy
network conditions.

 

I wrote one such job to craw images we need to acquire, and it was the
royalist of royal pains. I wonder if there are any good approaches to this
kind of data acquisition task in Hadoop. It would certainly be nicer just to
schedule a data-acquisition job ahead of the processing jobs in Oozie rather
than try to maintain synchronization between the download processes and the
jobs.

 

Ideas?

 

+
Peyman Mohajerian 2013-04-17, 16:41
+
Marcos Luis Ortiz Valmase... 2013-04-17, 19:59
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB