-Large input files via HTTP
David Parks 2012-10-22, 08:40
I want to create a MapReduce job which reads many multi-gigabyte input files
from various HTTP sources & processes them nightly.
Is there a reasonably flexible way to do this in the Hadoop job its self? I
expect the initial downloads to take many hours and I'd hope I can optimize
the # of connections (example: I'm limited to 5 connections to one host,
whereas another host has a 3-connection limit, so maximize as much as
possible). Also the set of files to download will change a little over time
so the input list should be easily configurable (in a config file or
Is it normal to perform batch downloads like this before running the
mapreduce job? Or is it ok to include such steps in with the job? It seems
handy to keep the whole process as one neat package in Hadoop if possible.