|
|
-
RE: Large input files via HTTPDavid Parks 2012-10-24, 04:06
I might very well be overthinking this. But I have a cluster Im firing up
on EC2 that I want to keep utilized. I have some other unrelated jobs that dont need to wait for the downloads, so I dont want to fill all the map slots with long downloads. Id rather the other jobs run in parallel while the downloads are happening. From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Seetharam Venkatesh Sent: Tuesday, October 23, 2012 1:10 PM To: [EMAIL PROTECTED] Subject: Re: Large input files via HTTP Well, it depends. :-) If the XML cannot be split, then you'd end up with only one map task for the entire set of files. I think it'd make sense to have multiple splits so you can get en even spread of copy across maps, retry only the failed copy and not manage the scheduling of the downloads. Look at DistCp for some intelligent splitting. What are the constraints that you are working with? On Mon, Oct 22, 2012 at 5:59 PM, David Parks <[EMAIL PROTECTED]> wrote: Would it make sense to write a map job that takes an unsplittable XML file (which defines all of the files I need to download); that one map job then kicks off the downloads in multiple threads. This way I can easily manage the most efficient download pattern within the map job, and my output is emitted as key,values straight to the reducer step? From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Seetharam Venkatesh Sent: Tuesday, October 23, 2012 7:28 AM To: [EMAIL PROTECTED] Subject: Re: Large input files via HTTP One possible way is to first create a list of files with tuples<host:port, filePath>. Then use a map-only job to pull each file using NLineInputFormat. Another way is to write a HttpInputFormat and HttpRecordReader and stream the data in a map-only job. On Mon, Oct 22, 2012 at 1:54 AM, David Parks <[EMAIL PROTECTED]> wrote: I want to create a MapReduce job which reads many multi-gigabyte input files from various HTTP sources & processes them nightly. Is there a reasonably flexible way to acquire the files in the Hadoop job its self? I expect the initial downloads to take many hours and I'd hope I can optimize the # of connections (example: I'm limited to 5 connections to one host, whereas another host has a 3-connection limit, so maximize as much as possible). Also the set of files to download will change a little over time so the input list should be easily configurable (in a config file or equivalent). - Is it normal to perform batch downloads like this *before* running the mapreduce job? - Or is it ok to include such steps in with the job? - It seems handy to keep the whole process as one neat package in Hadoop if possible. - What class should I implement if I wanted to manage this myself? Would I just extend TextInputFormat for example, and do the HTTP processing there? Or am I created a FileSystem? Thanks, David -- Regards, Venkatesh Perfection (in design) is achieved not when there is nothing more to add, but rather when there is nothing more to take away. - Antoine de Saint-Exupéry |