Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Mapreduce jobs to download job input from across the internet


+
David Parks 2013-04-17, 04:26
+
Peyman Mohajerian 2013-04-17, 16:41
Copy link to this message
-
Re: Mapreduce jobs to download job input from across the internet
You can find it here:
http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/
2013/4/17 Peyman Mohajerian <[EMAIL PROTECTED]>

> Apache Flume may help you for this use case. I read an article on
> Cloudera's site about using Flume to pull tweets and same idea may apply
> here.
>
>
> On Tue, Apr 16, 2013 at 9:26 PM, David Parks <[EMAIL PROTECTED]>wrote:
>
>> For a set of jobs to run I need to download about 100GB of data from the
>> internet (~1000 files of varying sizes from ~10 different domains).****
>>
>> ** **
>>
>> Currently I do this in a simple linux script as it’s easy to script FTP,
>> curl, and the like. But it’s a mess to maintain a separate server for that
>> process. I’d rather it run in mapreduce. Just give it a bill of materials
>> and let it go about downloading it, retrying as necessary to deal with iffy
>> network conditions.****
>>
>> ** **
>>
>> I wrote one such job to craw images we need to acquire, and it was the
>> royalist of royal pains. I wonder if there are any good approaches to this
>> kind of data acquisition task in Hadoop. It would certainly be nicer just
>> to schedule a data-acquisition job ahead of the processing jobs in Oozie
>> rather than try to maintain synchronization between the download processes
>> and the jobs.****
>>
>> ** **
>>
>> Ideas?****
>>
>> ** **
>>
>
>
--
Marcos Ortiz Valmaseda,
*Data-Driven Product Manager* at PDVSA
*Blog*: http://dataddict.wordpress.com/
*LinkedIn: *http://www.linkedin.com/in/marcosluis2186
*Twitter*: @marcosluis2186 <http://twitter.com/marcosluis2186>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB