|
|
-
Re: Insight on why distcp becomes slower when adding nodemanagerMarcos Ortiz 2012-10-31, 20:27
On 10/31/2012 02:23 PM, Michael Segel wrote: > Not sure. > > Lots of things can effect your throughput. > Networking is my first guess. Which is why I asked about the number of > times you've run the same test to see if there is a wide variation in > timings. > > On Oct 31, 2012, at 7:37 AM, Alexandre Fouche > <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > >> These instances have no swap. I tried 5 or 6 times in a row, and >> modified the yarn.nodemanager.resource.memory-mb but it did not help. >> Later on, i'll replace the openjdk with the Oracle java SE 1.6.31 to >> see if it improves overall performance. How many RAM do you have, and how much of it is assigned to your Hadoop services? >> Now i am running everything on medium instances for prototyping, and >> while this is better, i still find it abusively slow. Maybe bad >> hadoop performance on less than xlarge memory instances is to be >> expected on EC2 ? Are you using Hadoop on top of EC2 or are you using the EMR service? >> >> >> -- >> Alexandre Fouche >> Lead operations engineer, cloud architect >> http://www.cleverscale.com | @cleverscale >> Sent with Sparrow <http://www.sparrowmailapp.com/?sig> >> >> On Monday 29 October 2012 at 20:04, Michael Segel wrote: >> >>> how many times did you test it? >>> >>> need to rule out aberrations. >>> >>> On Oct 29, 2012, at 11:30 AM, Harsh J <[EMAIL PROTECTED] >>> <mailto:[EMAIL PROTECTED]>> wrote: >>> >>>> On your second low-memory NM instance, did you ensure to lower the >>>> yarn.nodemanager.resource.memory-mb property specifically to avoid >>>> swapping due to excessive resource grants? The default offered is 8 GB >>>> (>> 1.7 GB you have). >>>> >>>> On Mon, Oct 29, 2012 at 8:42 PM, Alexandre Fouche >>>> <[EMAIL PROTECTED] >>>> <mailto:[EMAIL PROTECTED]>> wrote: >>>>> Hi, >>>>> >>>>> Can someone give some insight on why a "distcp" of 600 files of a few >>>>> hundred bytes from s3n:// to local hdfs is taking 46s when using a >>>>> yarn-nodemanager EC2 instance with 16GB memory (which by the way i >>>>> think is >>>>> jokingly long), and taking 3mn30s when adding a second >>>>> yarn-nodemanager (a >>>>> small instance with 1.7GB memory) ? >>>>> I would have expected it to be a bit faster, not 5xlonger ! >>>>> >>>>> I have the same issue when i stop the small instance nodemanager >>>>> and restart >>>>> it to join the processing after the big nodemanager instance was >>>>> already >>>>> submitted the job. >>>>> >>>>> I am using Cloudera latest Yarn+HDFS on Amazon (rebranded Centos 6) >>>>> >>>>> #Staging 14:58:04 root@datanode2:hadoop-yarn: rpm -qa |grep hadoop >>>>> hadoop-hdfs-datanode-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64 >>>>> hadoop-mapreduce-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64 >>>>> hadoop-0.20-mapreduce-0.20.2+1261-1.cdh4.1.1.p0.4.el6.x86_64 >>>>> hadoop-yarn-nodemanager-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64 >>>>> hadoop-mapreduce-historyserver-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64 >>>>> hadoop-hdfs-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64 >>>>> hadoop-client-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64 >>>>> hadoop-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64 >>>>> hadoop-yarn-2.0.0+545-1.cdh4.1.1.p0.5.el6.x86_64 >>>>> >>>>> >>>>> #Staging 14:39:51 root@resourcemanager:hadoop-yarn: >>>>> HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce time hadoop distcp >>>>> -overwrite >>>>> s3n://xxx:[EMAIL PROTECTED]ev <mailto:[EMAIL PROTECTED]ev>/* >>>>> hdfs:///tmp/something/a >>>>> >>>>> 12/10/29 14:40:12 INFO tools.DistCp: Input Options: >>>>> DistCpOptions{atomicCommit=false, syncFolder=false, >>>>> deleteMissing=false, >>>>> ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', >>>>> copyStrategy='uniformsize', sourceFileListing=null, >>>>> sourcePaths=[s3n://xxx:[EMAIL PROTECTED]ev >>>>> <mailto:[EMAIL PROTECTED]ev>/*], >>>>> targetPath=hdfs:/tmp/something/a} >>>>> 12/10/29 14:40:18 WARN conf.Configuration: io.sort.mb is deprecated. >>>>> Instead, use mapreduce.task.io.sort.mb Marcos Luis Ort�z Valmaseda about.me/marcosortiz <http://about.me/marcosortiz> @marcosluis2186 <http://twitter.com/marcosluis2186> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci |