Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Large input files via HTTP


Copy link to this message
-
RE: Large input files via HTTP
David Parks 2012-10-23, 00:59
Would it make sense to write a map job that takes an unsplittable XML file
(which defines all of the files I need to download); that one map job then
kicks off the downloads in multiple threads. This way I can easily manage
the most efficient download pattern within the map job, and my output is
emitted as key,values straight to the reducer step?

 

 

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of
Seetharam Venkatesh
Sent: Tuesday, October 23, 2012 7:28 AM
To: [EMAIL PROTECTED]
Subject: Re: Large input files via HTTP

 

One possible way is to first create a list of files with tuples<host:port,
filePath>. Then use a map-only job to pull each file using NLineInputFormat.

 

Another way is to write a HttpInputFormat and HttpRecordReader and stream
the data in a map-only job.

On Mon, Oct 22, 2012 at 1:54 AM, David Parks <[EMAIL PROTECTED]> wrote:

I want to create a MapReduce job which reads many multi-gigabyte input files
from various HTTP sources & processes them nightly.

Is there a reasonably flexible way to acquire the files in the Hadoop job
its self? I expect the initial downloads to take many hours and I'd hope I
can optimize the # of connections (example: I'm limited to 5 connections to
one host, whereas another host has a 3-connection limit, so maximize as much
as possible).  Also the set of files to download will change a little over
time so the input list should be easily configurable (in a config file or
equivalent).

 - Is it normal to perform batch downloads like this *before* running the
mapreduce job?
 - Or is it ok to include such steps in with the job?
 - It seems handy to keep the whole process as one neat package in Hadoop if
possible.
 - What class should I implement if I wanted to manage this myself? Would I
just extend TextInputFormat for example, and do the HTTP processing there?
Or am I created a FileSystem?

Thanks,
David

 

--
Regards,
Venkatesh

 

“Perfection (in design) is achieved not when there is nothing more to add,
but rather when there is nothing more to take away.”

- Antoine de Saint-Exupéry