Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Remote connection bottleneck?


Copy link to this message
-
Re: Remote connection bottleneck?
Mario:
Can you show us the error when you run the following ?
"hadoop jar <route where I placed the file with the ssh connection>
<arguments>"
Hello,
>> please excuse my ignorance, but how can I run it from there?
>> Up to now I've been running the programs with "hadoop jar <localfile>
>> <arguments>".
>>
>> I tried copying the jar to the HDFS and using "hadoop jar <HDFS route>
>> <arguments>" but that didn't work (file not found), so I went to the ssh
>> connection and copied the jar to my directory in there, but now I don't know
>> how to run it from there.  "hadoop jar <route where I placed the file with
>> the ssh connection> " didn't work.
>>
>> I am not very experienced with ssh, so I am sorry if this is basic stuff.
>>
>> Thanks,
>>
>> Mario Maqueo
>> ITESM-CEM
>>
>> 2010/9/25 Ted Yu <[EMAIL PROTECTED]>
>>
>> Mario:
>>> Please produce a jar, place it on one of the servers in the cloud and run
>>> from there.
>>>
>>>
>>> On Sat, Sep 25, 2010 at 7:46 AM, Raja Thiruvathuru <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>> MapReduce doesn't download the actual data, but it reads meta-data
>>>> before it starts MapReduce job
>>>>
>>>>
>>>> On Sat, Sep 25, 2010 at 7:55 AM, Mario M <[EMAIL PROTECTED]>wrote:
>>>>
>>>>> Hello,
>>>>> I am having a problem that might be expected behaviour. I am using a
>>>>> cloud with Hadoop remotely through ssh. I have a program that runs for about
>>>>> a minute, it processes a 200 MB file using NLineInputFormat and the user
>>>>> decides the number of lines to divide the file. However, before the
>>>>> map-reduce phase starts, the part of the program that divides the input runs
>>>>> locally in my computer, which means that if I use a 100 Mbps connection to
>>>>> access the cloud, it isn't that much of a problem, but in my house with a 1
>>>>> Mbps connection, the program takes about 30 minutes or more to process this
>>>>> input. Apparently it is downloading the full 200 MB, processing them to
>>>>> decide the byte offsets for dividing the file and sending that to the cloud.
>>>>>
>>>>> This 30 minutes startup time kills all the advantages of using
>>>>> mapreduce for us. My question is, is this expected behaviour? Is the
>>>>> InputFormat phase of the program supposed to run locally and not in the
>>>>> cloud? Or am I doing something wrong?  As a contrast, I ran the terasort
>>>>> Hadoop example for 100 GB and it took 3-4 minutes of startup and then
>>>>> started the map phase, which clearly shows that it isn't downloading all the
>>>>> information. Terasort doesn't use NLineInputFormat, but still it has to read
>>>>> the files to divide them, or not?
>>>>>
>>>>> Thank you in advance for your time. :)
>>>>>
>>>>> Mario Maqueo
>>>>> ITESM-CEM
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Raja Thiruvathuru
>>>>
>>>
>>>
>>
>
>
> --
>
> Raja Thiruvathuru
>
On Sat, Sep 25, 2010 at 12:27 PM, Mario M <[EMAIL PROTECTED]> wrote:
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB