Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> RE: Large input files via HTTP


Copy link to this message
-
RE: Large input files via HTTP
I might very well be overthinking this. But I have a cluster I’m firing up
on EC2 that I want to keep utilized. I have some other unrelated jobs that
don’t need to wait  for the downloads, so I don’t want to fill all the map
slots with long downloads. I’d rather the other jobs run in parallel while
the downloads are happening.

 

 

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of
Seetharam Venkatesh
Sent: Tuesday, October 23, 2012 1:10 PM
To: [EMAIL PROTECTED]
Subject: Re: Large input files via HTTP

 

Well, it depends. :-)  If the XML cannot be split, then you'd end up with
only one map task for the entire set of files. I think it'd make sense to
have multiple splits so you can get en even spread of copy across maps,
retry only the failed copy and not manage the scheduling of the downloads.

 

Look at DistCp for some intelligent splitting.

 

What are the constraints that you are working with?

On Mon, Oct 22, 2012 at 5:59 PM, David Parks <[EMAIL PROTECTED]> wrote:

Would it make sense to write a map job that takes an unsplittable XML file
(which defines all of the files I need to download); that one map job then
kicks off the downloads in multiple threads. This way I can easily manage
the most efficient download pattern within the map job, and my output is
emitted as key,values straight to the reducer step?

 

 

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of
Seetharam Venkatesh
Sent: Tuesday, October 23, 2012 7:28 AM
To: [EMAIL PROTECTED]
Subject: Re: Large input files via HTTP

 

One possible way is to first create a list of files with tuples<host:port,
filePath>. Then use a map-only job to pull each file using NLineInputFormat.

 

Another way is to write a HttpInputFormat and HttpRecordReader and stream
the data in a map-only job.

On Mon, Oct 22, 2012 at 1:54 AM, David Parks <[EMAIL PROTECTED]> wrote:

I want to create a MapReduce job which reads many multi-gigabyte input files
from various HTTP sources & processes them nightly.

Is there a reasonably flexible way to acquire the files in the Hadoop job
its self? I expect the initial downloads to take many hours and I'd hope I
can optimize the # of connections (example: I'm limited to 5 connections to
one host, whereas another host has a 3-connection limit, so maximize as much
as possible).  Also the set of files to download will change a little over
time so the input list should be easily configurable (in a config file or
equivalent).

 - Is it normal to perform batch downloads like this *before* running the
mapreduce job?
 - Or is it ok to include such steps in with the job?
 - It seems handy to keep the whole process as one neat package in Hadoop if
possible.
 - What class should I implement if I wanted to manage this myself? Would I
just extend TextInputFormat for example, and do the HTTP processing there?
Or am I created a FileSystem?

Thanks,
David

 

--
Regards,
Venkatesh

 

“Perfection (in design) is achieved not when there is nothing more to add,
but rather when there is nothing more to take away.”

- Antoine de Saint-Exupéry