Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Remote connection bottleneck?


Copy link to this message
-
Re: Remote connection bottleneck?
Mario:
Please produce a jar, place it on one of the servers in the cloud and run
from there.

On Sat, Sep 25, 2010 at 7:46 AM, Raja Thiruvathuru
<[EMAIL PROTECTED]>wrote:

> MapReduce doesn't download the actual data, but it reads meta-data before
> it starts MapReduce job
>
>
> On Sat, Sep 25, 2010 at 7:55 AM, Mario M <[EMAIL PROTECTED]> wrote:
>
>> Hello,
>> I am having a problem that might be expected behaviour. I am using a cloud
>> with Hadoop remotely through ssh. I have a program that runs for about a
>> minute, it processes a 200 MB file using NLineInputFormat and the user
>> decides the number of lines to divide the file. However, before the
>> map-reduce phase starts, the part of the program that divides the input runs
>> locally in my computer, which means that if I use a 100 Mbps connection to
>> access the cloud, it isn't that much of a problem, but in my house with a 1
>> Mbps connection, the program takes about 30 minutes or more to process this
>> input. Apparently it is downloading the full 200 MB, processing them to
>> decide the byte offsets for dividing the file and sending that to the cloud.
>>
>> This 30 minutes startup time kills all the advantages of using mapreduce
>> for us. My question is, is this expected behaviour? Is the InputFormat phase
>> of the program supposed to run locally and not in the cloud? Or am I doing
>> something wrong?  As a contrast, I ran the terasort Hadoop example for 100
>> GB and it took 3-4 minutes of startup and then started the map phase, which
>> clearly shows that it isn't downloading all the information. Terasort
>> doesn't use NLineInputFormat, but still it has to read the files to divide
>> them, or not?
>>
>> Thank you in advance for your time. :)
>>
>> Mario Maqueo
>> ITESM-CEM
>>
>
>
>
> --
>
> Raja Thiruvathuru
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB