Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce, mail # user - Mapreduce jobs to download job input from across the internet


+
David Parks 2013-04-17, 04:26
+
Peyman Mohajerian 2013-04-17, 16:41
Copy link to this message
-
Re: Mapreduce jobs to download job input from across the internet
Marcos Luis Ortiz Valmase... 2013-04-17, 19:59
You can find it here:
http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/
2013/4/17 Peyman Mohajerian <[EMAIL PROTECTED]>

> Apache Flume may help you for this use case. I read an article on
> Cloudera's site about using Flume to pull tweets and same idea may apply
> here.
>
>
> On Tue, Apr 16, 2013 at 9:26 PM, David Parks <[EMAIL PROTECTED]>wrote:
>
>> For a set of jobs to run I need to download about 100GB of data from the
>> internet (~1000 files of varying sizes from ~10 different domains).****
>>
>> ** **
>>
>> Currently I do this in a simple linux script as it’s easy to script FTP,
>> curl, and the like. But it’s a mess to maintain a separate server for that
>> process. I’d rather it run in mapreduce. Just give it a bill of materials
>> and let it go about downloading it, retrying as necessary to deal with iffy
>> network conditions.****
>>
>> ** **
>>
>> I wrote one such job to craw images we need to acquire, and it was the
>> royalist of royal pains. I wonder if there are any good approaches to this
>> kind of data acquisition task in Hadoop. It would certainly be nicer just
>> to schedule a data-acquisition job ahead of the processing jobs in Oozie
>> rather than try to maintain synchronization between the download processes
>> and the jobs.****
>>
>> ** **
>>
>> Ideas?****
>>
>> ** **
>>
>
>
--
Marcos Ortiz Valmaseda,
*Data-Driven Product Manager* at PDVSA
*Blog*: http://dataddict.wordpress.com/
*LinkedIn: *http://www.linkedin.com/in/marcosluis2186
*Twitter*: @marcosluis2186 <http://twitter.com/marcosluis2186>