Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Remote connection bottleneck?


Copy link to this message
-
Re: Remote connection bottleneck?
Mario:
Can you show us the error when you run the following ?
"hadoop jar <route where I placed the file with the ssh connection>
<arguments>"
Hello,
>> please excuse my ignorance, but how can I run it from there?
>> Up to now I've been running the programs with "hadoop jar <localfile>
>> <arguments>".
>>
>> I tried copying the jar to the HDFS and using "hadoop jar <HDFS route>
>> <arguments>" but that didn't work (file not found), so I went to the ssh
>> connection and copied the jar to my directory in there, but now I don't know
>> how to run it from there.  "hadoop jar <route where I placed the file with
>> the ssh connection> " didn't work.
>>
>> I am not very experienced with ssh, so I am sorry if this is basic stuff.
>>
>> Thanks,
>>
>> Mario Maqueo
>> ITESM-CEM
>>
>> 2010/9/25 Ted Yu <[EMAIL PROTECTED]>
>>
>> Mario:
>>> Please produce a jar, place it on one of the servers in the cloud and run
>>> from there.
>>>
>>>
>>> On Sat, Sep 25, 2010 at 7:46 AM, Raja Thiruvathuru <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>> MapReduce doesn't download the actual data, but it reads meta-data
>>>> before it starts MapReduce job
>>>>
>>>>
>>>> On Sat, Sep 25, 2010 at 7:55 AM, Mario M <[EMAIL PROTECTED]>wrote:
>>>>
>>>>> Hello,
>>>>> I am having a problem that might be expected behaviour. I am using a
>>>>> cloud with Hadoop remotely through ssh. I have a program that runs for about
>>>>> a minute, it processes a 200 MB file using NLineInputFormat and the user
>>>>> decides the number of lines to divide the file. However, before the
>>>>> map-reduce phase starts, the part of the program that divides the input runs
>>>>> locally in my computer, which means that if I use a 100 Mbps connection to
>>>>> access the cloud, it isn't that much of a problem, but in my house with a 1
>>>>> Mbps connection, the program takes about 30 minutes or more to process this
>>>>> input. Apparently it is downloading the full 200 MB, processing them to
>>>>> decide the byte offsets for dividing the file and sending that to the cloud.
>>>>>
>>>>> This 30 minutes startup time kills all the advantages of using
>>>>> mapreduce for us. My question is, is this expected behaviour? Is the
>>>>> InputFormat phase of the program supposed to run locally and not in the
>>>>> cloud? Or am I doing something wrong?  As a contrast, I ran the terasort
>>>>> Hadoop example for 100 GB and it took 3-4 minutes of startup and then
>>>>> started the map phase, which clearly shows that it isn't downloading all the
>>>>> information. Terasort doesn't use NLineInputFormat, but still it has to read
>>>>> the files to divide them, or not?
>>>>>
>>>>> Thank you in advance for your time. :)
>>>>>
>>>>> Mario Maqueo
>>>>> ITESM-CEM
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Raja Thiruvathuru
>>>>
>>>
>>>
>>
>
>
> --
>
> Raja Thiruvathuru
>
On Sat, Sep 25, 2010 at 12:27 PM, Mario M <[EMAIL PROTECTED]> wrote: