data ingress is often done as an initial MR job.
Here is sounds like you'd need
-a list of URLs, which you can have a single mapper run through and map to
which feeds to the reducer:
hostname [url1, url2, ..]
the reducer on each hostname key can do the GET operations for that host,
using whatever per-host limits you have. Remember to keep sending
heartbeats to the Task Tracker so it knows that your process is alive. Oh,
and see if you grab any content-length and checksum header keys to verify
at the end of a long download -you don't want to accidentally pull a
half-complete D/L into your work.
once the files are in HDFS you can do more work on them, which is where
something like an OOzie workflow can be handy.
On 22 October 2012 09:40, David Parks <[EMAIL PROTECTED]> wrote:
> I want to create a MapReduce job which reads many multi-gigabyte input
> from various HTTP sources & processes them nightly.
> Is there a reasonably flexible way to do this in the Hadoop job its self? I
> expect the initial downloads to take many hours and I'd hope I can optimize
> the # of connections (example: I'm limited to 5 connections to one host,
> whereas another host has a 3-connection limit, so maximize as much as
> possible). Also the set of files to download will change a little over
> so the input list should be easily configurable (in a config file or
> Is it normal to perform batch downloads like this before running the
> mapreduce job? Or is it ok to include such steps in with the job? It seems
> handy to keep the whole process as one neat package in Hadoop if possible.