Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Insight on why distcp becomes slower when adding nodemanager


Copy link to this message
-
Re: Insight on why distcp becomes slower when adding nodemanager
Not sure.

Lots of things can effect your throughput.
Networking is my first guess. Which is why I asked about the number of times you've run the same test to see if there is a wide variation in timings.

On Oct 31, 2012, at 7:37 AM, Alexandre Fouche <[EMAIL PROTECTED]> wrote:

> These instances have no swap. I tried 5 or 6 times in a row, and modified the yarn.nodemanager.resource.memory-mb but it did not help. Later on, i'll replace the openjdk with the Oracle java SE 1.6.31 to see if it improves overall performance.
> Now i am running everything on medium instances for prototyping, and while this is better, i still find it abusively slow. Maybe bad hadoop performance on less than xlarge memory instances is to be expected on EC2 ?
>
>
> --
> Alexandre Fouche
> Lead operations engineer, cloud architect
> http://www.cleverscale.com | @cleverscale
> Sent with Sparrow
>
> On Monday 29 October 2012 at 20:04, Michael Segel wrote:
>
>> how many times did you test it?
>>
>> need to rule out aberrations.
>>
>> On Oct 29, 2012, at 11:30 AM, Harsh J <[EMAIL PROTECTED]> wrote:
>>
>>> On your second low-memory NM instance, did you ensure to lower the
>>> yarn.nodemanager.resource.memory-mb property specifically to avoid
>>> swapping due to excessive resource grants? The default offered is 8 GB
>>> (>> 1.7 GB you have).
>>>
>>> On Mon, Oct 29, 2012 at 8:42 PM, Alexandre Fouche
>>> <[EMAIL PROTECTED]> wrote:
>>>> Hi,
>>>>
>>>> Can someone give some insight on why a "distcp" of 600 files of a few
>>>> hundred bytes from s3n:// to local hdfs is taking 46s when using a
>>>> yarn-nodemanager EC2 instance with 16GB memory (which by the way i think is
>>>> jokingly long), and taking 3mn30s when adding a second yarn-nodemanager (a
>>>> small instance with 1.7GB memory) ?
>>>> I would have expected it to be a bit faster, not 5xlonger !
>>>>
>>>> I have the same issue when i stop the small instance nodemanager and restart
>>>> it to join the processing after the big nodemanager instance was already
>>>> submitted the job.
>>>>
>>>> I am using Cloudera latest Yarn+HDFS on Amazon (rebranded Centos 6)
>>>>
>>>> #Staging 14:58:04 root@datanode2:hadoop-yarn: rpm -qa |grep hadoop
>>>> hadoop-hdfs-datanode-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64
>>>> hadoop-mapreduce-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64
>>>> hadoop-0.20-mapreduce-0.20.2+1261-1.cdh4.1.1.p0.4.el6.x86_64
>>>> hadoop-yarn-nodemanager-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64
>>>> hadoop-mapreduce-historyserver-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64
>>>> hadoop-hdfs-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64
>>>> hadoop-client-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64
>>>> hadoop-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64
>>>> hadoop-yarn-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64
>>>>
>>>>
>>>> #Staging 14:39:51 root@resourcemanager:hadoop-yarn:
>>>> HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce time hadoop distcp -overwrite
>>>> s3n://xxx:[EMAIL PROTECTED]ev/* hdfs:///tmp/something/a
>>>>
>>>> 12/10/29 14:40:12 INFO tools.DistCp: Input Options:
>>>> DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false,
>>>> ignoreFailures=false, maxMaps=20, sslConfigurationFile='null',
>>>> copyStrategy='uniformsize', sourceFileListing=null,
>>>> sourcePaths=[s3n://xxx:[EMAIL PROTECTED]ev/*],
>>>> targetPath=hdfs:/tmp/something/a}
>>>> 12/10/29 14:40:18 WARN conf.Configuration: io.sort.mb is deprecated.
>>>> Instead, use mapreduce.task.io.sort.mb
>>>> 12/10/29 14:40:18 WARN conf.Configuration: io.sort.factor is deprecated.
>>>> Instead, use mapreduce.task.io.sort.factor
>>>> 12/10/29 14:40:19 INFO mapreduce.JobSubmitter: number of splits:15
>>>> 12/10/29 14:40:19 WARN conf.Configuration: mapred.jar is deprecated.
>>>> Instead, use mapreduce.job.jar
>>>> 12/10/29 14:40:19 WARN conf.Configuration:
>>>> mapred.map.tasks.speculative.execution is deprecated. Instead, use
>>>> mapreduce.map.speculative
>>>> 12/10/29 14:40:19 WARN conf.Configuration: mapred.reduce.tasks is
>>>> deprecated. Instead, use mapreduce.job.reduces
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB