Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Large input files via HTTP


Copy link to this message
-
Large input files via HTTP
I want to create a MapReduce job which reads many multi-gigabyte input files
from various HTTP sources & processes them nightly.
  
Is there a reasonably flexible way to acquire the files in the Hadoop job
its self? I expect the initial downloads to take many hours and I'd hope I
can optimize the # of connections (example: I'm limited to 5 connections to
one host, whereas another host has a 3-connection limit, so maximize as much
as possible).  Also the set of files to download will change a little over
time so the input list should be easily configurable (in a config file or
equivalent).
  
 - Is it normal to perform batch downloads like this *before* running the
mapreduce job?
 - Or is it ok to include such steps in with the job?
 - It seems handy to keep the whole process as one neat package in Hadoop if
possible.  
 - What class should I implement if I wanted to manage this myself? Would I
just extend TextInputFormat for example, and do the HTTP processing there?
Or am I created a FileSystem?

Thanks,
David
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB