Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Mapreduce jobs to download job input from across the internet


Copy link to this message
-
Re: Mapreduce jobs to download job input from across the internet
Peyman Mohajerian 2013-04-17, 16:41
Apache Flume may help you for this use case. I read an article on
Cloudera's site about using Flume to pull tweets and same idea may apply
here.
On Tue, Apr 16, 2013 at 9:26 PM, David Parks <[EMAIL PROTECTED]> wrote:

> For a set of jobs to run I need to download about 100GB of data from the
> internet (~1000 files of varying sizes from ~10 different domains).****
>
> ** **
>
> Currently I do this in a simple linux script as it’s easy to script FTP,
> curl, and the like. But it’s a mess to maintain a separate server for that
> process. I’d rather it run in mapreduce. Just give it a bill of materials
> and let it go about downloading it, retrying as necessary to deal with iffy
> network conditions.****
>
> ** **
>
> I wrote one such job to craw images we need to acquire, and it was the
> royalist of royal pains. I wonder if there are any good approaches to this
> kind of data acquisition task in Hadoop. It would certainly be nicer just
> to schedule a data-acquisition job ahead of the processing jobs in Oozie
> rather than try to maintain synchronization between the download processes
> and the jobs.****
>
> ** **
>
> Ideas?****
>
> ** **
>