Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Insight on why distcp becomes slower when adding nodemanager


Copy link to this message
-
Re: Insight on why distcp becomes slower when adding nodemanager
Not sure.

Lots of things can effect your throughput.
Networking is my first guess. Which is why I asked about the number of times you've run the same test to see if there is a wide variation in timings.

On Oct 31, 2012, at 7:37 AM, Alexandre Fouche <[EMAIL PROTECTED]> wrote:

> These instances have no swap. I tried 5 or 6 times in a row, and modified the yarn.nodemanager.resource.memory-mb but it did not help. Later on, i'll replace the openjdk with the Oracle java SE 1.6.31 to see if it improves overall performance.
> Now i am running everything on medium instances for prototyping, and while this is better, i still find it abusively slow. Maybe bad hadoop performance on less than xlarge memory instances is to be expected on EC2 ?
>
>
> --
> Alexandre Fouche
> Lead operations engineer, cloud architect
> http://www.cleverscale.com | @cleverscale
> Sent with Sparrow
>
> On Monday 29 October 2012 at 20:04, Michael Segel wrote:
>
>> how many times did you test it?
>>
>> need to rule out aberrations.
>>
>> On Oct 29, 2012, at 11:30 AM, Harsh J <[EMAIL PROTECTED]> wrote:
>>
>>> On your second low-memory NM instance, did you ensure to lower the
>>> yarn.nodemanager.resource.memory-mb property specifically to avoid
>>> swapping due to excessive resource grants? The default offered is 8 GB
>>> (>> 1.7 GB you have).
>>>
>>> On Mon, Oct 29, 2012 at 8:42 PM, Alexandre Fouche
>>> <[EMAIL PROTECTED]> wrote:
>>>> Hi,
>>>>
>>>> Can someone give some insight on why a "distcp" of 600 files of a few
>>>> hundred bytes from s3n:// to local hdfs is taking 46s when using a
>>>> yarn-nodemanager EC2 instance with 16GB memory (which by the way i think is
>>>> jokingly long), and taking 3mn30s when adding a second yarn-nodemanager (a
>>>> small instance with 1.7GB memory) ?
>>>> I would have expected it to be a bit faster, not 5xlonger !
>>>>
>>>> I have the same issue when i stop the small instance nodemanager and restart
>>>> it to join the processing after the big nodemanager instance was already
>>>> submitted the job.
>>>>
>>>> I am using Cloudera latest Yarn+HDFS on Amazon (rebranded Centos 6)
>>>>
>>>> #Staging 14:58:04 root@datanode2:hadoop-yarn: rpm -qa |grep hadoop
>>>> hadoop-hdfs-datanode-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64
>>>> hadoop-mapreduce-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64
>>>> hadoop-0.20-mapreduce-0.20.2+1261-1.cdh4.1.1.p0.4.el6.x86_64
>>>> hadoop-yarn-nodemanager-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64
>>>> hadoop-mapreduce-historyserver-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64
>>>> hadoop-hdfs-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64
>>>> hadoop-client-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64
>>>> hadoop-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64
>>>> hadoop-yarn-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64
>>>>
>>>>
>>>> #Staging 14:39:51 root@resourcemanager:hadoop-yarn:
>>>> HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce time hadoop distcp -overwrite
>>>> s3n://xxx:[EMAIL PROTECTED]ev/* hdfs:///tmp/something/a
>>>>
>>>> 12/10/29 14:40:12 INFO tools.DistCp: Input Options:
>>>> DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false,
>>>> ignoreFailures=false, maxMaps=20, sslConfigurationFile='null',
>>>> copyStrategy='uniformsize', sourceFileListing=null,
>>>> sourcePaths=[s3n://xxx:[EMAIL PROTECTED]ev/*],
>>>> targetPath=hdfs:/tmp/something/a}
>>>> 12/10/29 14:40:18 WARN conf.Configuration: io.sort.mb is deprecated.
>>>> Instead, use mapreduce.task.io.sort.mb
>>>> 12/10/29 14:40:18 WARN conf.Configuration: io.sort.factor is deprecated.
>>>> Instead, use mapreduce.task.io.sort.factor
>>>> 12/10/29 14:40:19 INFO mapreduce.JobSubmitter: number of splits:15
>>>> 12/10/29 14:40:19 WARN conf.Configuration: mapred.jar is deprecated.
>>>> Instead, use mapreduce.job.jar
>>>> 12/10/29 14:40:19 WARN conf.Configuration:
>>>> mapred.map.tasks.speculative.execution is deprecated. Instead, use
>>>> mapreduce.map.speculative
>>>> 12/10/29 14:40:19 WARN conf.Configuration: mapred.reduce.tasks is
>>>> deprecated. Instead, use mapreduce.job.reduces