|
|
-
Large input files via HTTP
David Parks 2012-10-22, 08:40
I want to create a MapReduce job which reads many multi-gigabyte input files from various HTTP sources & processes them nightly.
Is there a reasonably flexible way to do this in the Hadoop job its self? I expect the initial downloads to take many hours and I'd hope I can optimize the # of connections (example: I'm limited to 5 connections to one host, whereas another host has a 3-connection limit, so maximize as much as possible). Also the set of files to download will change a little over time so the input list should be easily configurable (in a config file or equivalent).
Is it normal to perform batch downloads like this before running the mapreduce job? Or is it ok to include such steps in with the job? It seems handy to keep the whole process as one neat package in Hadoop if possible.
-
Re: Large input files via HTTP
Steve Loughran 2012-10-22, 08:47
data ingress is often done as an initial MR job.
Here is sounds like you'd need -a list of URLs, which you can have a single mapper run through and map to (hostname,url)
which feeds to the reducer:
hostname [url1, url2, ..]
the reducer on each hostname key can do the GET operations for that host, using whatever per-host limits you have. Remember to keep sending heartbeats to the Task Tracker so it knows that your process is alive. Oh, and see if you grab any content-length and checksum header keys to verify at the end of a long download -you don't want to accidentally pull a half-complete D/L into your work.
once the files are in HDFS you can do more work on them, which is where something like an OOzie workflow can be handy.
On 22 October 2012 09:40, David Parks <[EMAIL PROTECTED]> wrote:
> I want to create a MapReduce job which reads many multi-gigabyte input > files > from various HTTP sources & processes them nightly. > > Is there a reasonably flexible way to do this in the Hadoop job its self? I > expect the initial downloads to take many hours and I'd hope I can optimize > the # of connections (example: I'm limited to 5 connections to one host, > whereas another host has a 3-connection limit, so maximize as much as > possible). Also the set of files to download will change a little over > time > so the input list should be easily configurable (in a config file or > equivalent). > > Is it normal to perform batch downloads like this before running the > mapreduce job? Or is it ok to include such steps in with the job? It seems > handy to keep the whole process as one neat package in Hadoop if possible. > >
-
Large input files via HTTP
David Parks 2012-10-22, 08:54
I want to create a MapReduce job which reads many multi-gigabyte input files from various HTTP sources & processes them nightly. Is there a reasonably flexible way to acquire the files in the Hadoop job its self? I expect the initial downloads to take many hours and I'd hope I can optimize the # of connections (example: I'm limited to 5 connections to one host, whereas another host has a 3-connection limit, so maximize as much as possible). Also the set of files to download will change a little over time so the input list should be easily configurable (in a config file or equivalent). - Is it normal to perform batch downloads like this *before* running the mapreduce job? - Or is it ok to include such steps in with the job? - It seems handy to keep the whole process as one neat package in Hadoop if possible. - What class should I implement if I wanted to manage this myself? Would I just extend TextInputFormat for example, and do the HTTP processing there? Or am I created a FileSystem?
Thanks, David
-
Re: Large input files via HTTP
Seetharam Venkatesh 2012-10-23, 00:27
One possible way is to first create a list of files with tuples<host:port, filePath>. Then use a map-only job to pull each file using NLineInputFormat.
Another way is to write a HttpInputFormat and HttpRecordReader and stream the data in a map-only job.
On Mon, Oct 22, 2012 at 1:54 AM, David Parks <[EMAIL PROTECTED]> wrote:
> I want to create a MapReduce job which reads many multi-gigabyte input > files > from various HTTP sources & processes them nightly. > > Is there a reasonably flexible way to acquire the files in the Hadoop job > its self? I expect the initial downloads to take many hours and I'd hope I > can optimize the # of connections (example: I'm limited to 5 connections to > one host, whereas another host has a 3-connection limit, so maximize as > much > as possible). Also the set of files to download will change a little over > time so the input list should be easily configurable (in a config file or > equivalent). > > - Is it normal to perform batch downloads like this *before* running the > mapreduce job? > - Or is it ok to include such steps in with the job? > - It seems handy to keep the whole process as one neat package in Hadoop > if > possible. > - What class should I implement if I wanted to manage this myself? Would I > just extend TextInputFormat for example, and do the HTTP processing there? > Or am I created a FileSystem? > > Thanks, > David > > > -- Regards, Venkatesh
“Perfection (in design) is achieved not when there is nothing more to add, but rather when there is nothing more to take away.” - Antoine de Saint-Exupéry
-
RE: Large input files via HTTP
David Parks 2012-10-23, 00:59
Would it make sense to write a map job that takes an unsplittable XML file (which defines all of the files I need to download); that one map job then kicks off the downloads in multiple threads. This way I can easily manage the most efficient download pattern within the map job, and my output is emitted as key,values straight to the reducer step?
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Seetharam Venkatesh Sent: Tuesday, October 23, 2012 7:28 AM To: [EMAIL PROTECTED] Subject: Re: Large input files via HTTP
One possible way is to first create a list of files with tuples<host:port, filePath>. Then use a map-only job to pull each file using NLineInputFormat.
Another way is to write a HttpInputFormat and HttpRecordReader and stream the data in a map-only job.
On Mon, Oct 22, 2012 at 1:54 AM, David Parks <[EMAIL PROTECTED]> wrote:
I want to create a MapReduce job which reads many multi-gigabyte input files from various HTTP sources & processes them nightly.
Is there a reasonably flexible way to acquire the files in the Hadoop job its self? I expect the initial downloads to take many hours and I'd hope I can optimize the # of connections (example: I'm limited to 5 connections to one host, whereas another host has a 3-connection limit, so maximize as much as possible). Also the set of files to download will change a little over time so the input list should be easily configurable (in a config file or equivalent).
- Is it normal to perform batch downloads like this *before* running the mapreduce job? - Or is it ok to include such steps in with the job? - It seems handy to keep the whole process as one neat package in Hadoop if possible. - What class should I implement if I wanted to manage this myself? Would I just extend TextInputFormat for example, and do the HTTP processing there? Or am I created a FileSystem?
Thanks, David
-- Regards, Venkatesh
Perfection (in design) is achieved not when there is nothing more to add, but rather when there is nothing more to take away.
- Antoine de Saint-Exupéry
-
Re: Large input files via HTTP
Seetharam Venkatesh 2012-10-23, 06:09
Well, it depends. :-) If the XML cannot be split, then you'd end up with only one map task for the entire set of files. I think it'd make sense to have multiple splits so you can get en even spread of copy across maps, retry only the failed copy and not manage the scheduling of the downloads.
Look at DistCp for some intelligent splitting.
What are the constraints that you are working with?
On Mon, Oct 22, 2012 at 5:59 PM, David Parks <[EMAIL PROTECTED]> wrote:
> Would it make sense to write a map job that takes an unsplittable XML file > (which defines all of the files I need to download); that one map job then > kicks off the downloads in multiple threads. This way I can easily manage > the most efficient download pattern within the map job, and my output is > emitted as key,values straight to the reducer step?**** > > ** ** > > ** ** > > *From:* [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] *On Behalf Of *Seetharam > Venkatesh > *Sent:* Tuesday, October 23, 2012 7:28 AM > *To:* [EMAIL PROTECTED] > *Subject:* Re: Large input files via HTTP**** > > ** ** > > One possible way is to first create a list of files with tuples<host:port, > filePath>. Then use a map-only job to pull each file using NLineInputFormat. > **** > > ** ** > > Another way is to write a HttpInputFormat and HttpRecordReader and stream > the data in a map-only job.**** > > On Mon, Oct 22, 2012 at 1:54 AM, David Parks <[EMAIL PROTECTED]> > wrote:**** > > I want to create a MapReduce job which reads many multi-gigabyte input > files > from various HTTP sources & processes them nightly. > > Is there a reasonably flexible way to acquire the files in the Hadoop job > its self? I expect the initial downloads to take many hours and I'd hope I > can optimize the # of connections (example: I'm limited to 5 connections to > one host, whereas another host has a 3-connection limit, so maximize as > much > as possible). Also the set of files to download will change a little over > time so the input list should be easily configurable (in a config file or > equivalent). > > - Is it normal to perform batch downloads like this *before* running the > mapreduce job? > - Or is it ok to include such steps in with the job? > - It seems handy to keep the whole process as one neat package in Hadoop > if > possible. > - What class should I implement if I wanted to manage this myself? Would I > just extend TextInputFormat for example, and do the HTTP processing there? > Or am I created a FileSystem? > > Thanks, > David > > **** > > > > **** > > ** ** > > -- > Regards, > Venkatesh**** > > ** ** > > “Perfection (in design) is achieved not when there is nothing more to add, > but rather when there is nothing more to take away.” **** > > - Antoine de Saint-Exupéry**** > > ** ** >
|
|