Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Remote connection bottleneck?

Copy link to this message
Re: Remote connection bottleneck?
Please produce a jar, place it on one of the servers in the cloud and run
from there.

On Sat, Sep 25, 2010 at 7:46 AM, Raja Thiruvathuru

> MapReduce doesn't download the actual data, but it reads meta-data before
> it starts MapReduce job
> On Sat, Sep 25, 2010 at 7:55 AM, Mario M <[EMAIL PROTECTED]> wrote:
>> Hello,
>> I am having a problem that might be expected behaviour. I am using a cloud
>> with Hadoop remotely through ssh. I have a program that runs for about a
>> minute, it processes a 200 MB file using NLineInputFormat and the user
>> decides the number of lines to divide the file. However, before the
>> map-reduce phase starts, the part of the program that divides the input runs
>> locally in my computer, which means that if I use a 100 Mbps connection to
>> access the cloud, it isn't that much of a problem, but in my house with a 1
>> Mbps connection, the program takes about 30 minutes or more to process this
>> input. Apparently it is downloading the full 200 MB, processing them to
>> decide the byte offsets for dividing the file and sending that to the cloud.
>> This 30 minutes startup time kills all the advantages of using mapreduce
>> for us. My question is, is this expected behaviour? Is the InputFormat phase
>> of the program supposed to run locally and not in the cloud? Or am I doing
>> something wrong?  As a contrast, I ran the terasort Hadoop example for 100
>> GB and it took 3-4 minutes of startup and then started the map phase, which
>> clearly shows that it isn't downloading all the information. Terasort
>> doesn't use NLineInputFormat, but still it has to read the files to divide
>> them, or not?
>> Thank you in advance for your time. :)
>> Mario Maqueo
> --
> Raja Thiruvathuru