|
George Stathis
2010-08-17, 18:03
George Stathis
2010-08-17, 22:11
Cosmin Lehene
2010-08-18, 07:26
Jean-Daniel Cryans
2010-08-17, 22:14
George P. Stathis
2010-08-17, 22:49
Ryan Rawson
2010-08-17, 22:54
George P. Stathis
2010-08-17, 23:03
Ryan Rawson
2010-08-17, 23:08
George P. Stathis
2010-08-17, 23:39
George P. Stathis
2010-08-18, 00:28
Ryan Rawson
2010-08-18, 00:59
George P. Stathis
2010-08-18, 02:04
|
-
High OS Load Numbers when idleGeorge Stathis 2010-08-17, 18:03
Hello,
We have just setup a new cluster on EC2 using Hadoop 0.20.2 and HBase 0.20.3. Our small setup as of right now consists of one master and four slaves with a replication factor of 2: Master: xLarge instance with 2 CPUs and 17.5 GB RAM - runs 1 namenode, 1 secondarynamenode, 1 jobtracker, 1 hbasemaster, 1 zookeeper (uses its' own dedicated EMS drive) Slaves: xLarge instance with 2 CPUs and 17.5 GB RAM each - run 1 datanode, 1 tasktracker, 1 regionserver We have also installed Ganglia to monitor the cluster stats as we are about to run some performance tests but, right out of the box, we are noticing high system loads (especially on the master node) without any activity happening on the clister. Of course, the CPUs are not being utilized at all, but Ganglia is reporting almost all nodes in the red as the 1, 5 an 15 minute load times are all above 100% most of the time (i.e. there are more than two processes at a time competing for the 2 CPUs time). Question1: is this normal? Question2: does it matter since each process barely uses any of the CPU time? Thank you in advance and pardon the noob questions. -GS +
George Stathis 2010-08-17, 18:03
-
Re: High OS Load Numbers when idleGeorge Stathis 2010-08-17, 22:11
No takers? :-) Am I missing something too obvious?
On Tue, Aug 17, 2010 at 2:03 PM, George Stathis <[EMAIL PROTECTED]> wrote: > Hello, > > We have just setup a new cluster on EC2 using Hadoop 0.20.2 and HBase > 0.20.3. Our small setup as of right now consists of one master and four > slaves with a replication factor of 2: > > Master: xLarge instance with 2 CPUs and 17.5 GB RAM - runs 1 namenode, 1 > secondarynamenode, 1 jobtracker, 1 hbasemaster, 1 zookeeper (uses its' own > dedicated EMS drive) > Slaves: xLarge instance with 2 CPUs and 17.5 GB RAM each - run 1 datanode, > 1 tasktracker, 1 regionserver > > We have also installed Ganglia to monitor the cluster stats as we are about > to run some performance tests but, right out of the box, we are noticing > high system loads (especially on the master node) without any activity > happening on the clister. Of course, the CPUs are not being utilized at all, > but Ganglia is reporting almost all nodes in the red as the 1, 5 an 15 > minute load times are all above 100% most of the time (i.e. there are more > than two processes at a time competing for the 2 CPUs time). > > Question1: is this normal? > Question2: does it matter since each process barely uses any of the CPU > time? > > Thank you in advance and pardon the noob questions. > > -GS > +
George Stathis 2010-08-17, 22:11
-
Re: High OS Load Numbers when idleCosmin Lehene 2010-08-18, 07:26
George,
Did you try to monitor the CPU in real-time to see any patterns? Does it do any CPU spikes? Even so, the load averages should be matched against the number of cores/hyper threads. How many cores do you see in top? (I personally prefer htop to get a cluster overview in realtime). If you have 2 CPUs with 2 cores each the maximum load you should have would be 4. However if the cluster is idle it should go as low as 0.2 - 0.3. If you don't see the real-time utilization I'd check if Ganglia is set up correctly. Cosmin On Aug 18, 2010, at 1:11 AM, George Stathis wrote: > No takers? :-) Am I missing something too obvious? > > On Tue, Aug 17, 2010 at 2:03 PM, George Stathis <[EMAIL PROTECTED]> wrote: > >> Hello, >> >> We have just setup a new cluster on EC2 using Hadoop 0.20.2 and HBase >> 0.20.3. Our small setup as of right now consists of one master and four >> slaves with a replication factor of 2: >> >> Master: xLarge instance with 2 CPUs and 17.5 GB RAM - runs 1 namenode, 1 >> secondarynamenode, 1 jobtracker, 1 hbasemaster, 1 zookeeper (uses its' own >> dedicated EMS drive) >> Slaves: xLarge instance with 2 CPUs and 17.5 GB RAM each - run 1 datanode, >> 1 tasktracker, 1 regionserver >> >> We have also installed Ganglia to monitor the cluster stats as we are about >> to run some performance tests but, right out of the box, we are noticing >> high system loads (especially on the master node) without any activity >> happening on the clister. Of course, the CPUs are not being utilized at all, >> but Ganglia is reporting almost all nodes in the red as the 1, 5 an 15 >> minute load times are all above 100% most of the time (i.e. there are more >> than two processes at a time competing for the 2 CPUs time). >> >> Question1: is this normal? >> Question2: does it matter since each process barely uses any of the CPU >> time? >> >> Thank you in advance and pardon the noob questions. >> >> -GS >> +
Cosmin Lehene 2010-08-18, 07:26
-
Re: High OS Load Numbers when idleJean-Daniel Cryans 2010-08-17, 22:14
It's not normal, but then again I don't have access to your machines
so I can only speculate. Does "top" show you which process is in %wa? If so and it's a java process, can you figure what's going on in there? J-D On Tue, Aug 17, 2010 at 11:03 AM, George Stathis <[EMAIL PROTECTED]> wrote: > Hello, > > We have just setup a new cluster on EC2 using Hadoop 0.20.2 and HBase > 0.20.3. Our small setup as of right now consists of one master and four > slaves with a replication factor of 2: > > Master: xLarge instance with 2 CPUs and 17.5 GB RAM - runs 1 namenode, 1 > secondarynamenode, 1 jobtracker, 1 hbasemaster, 1 zookeeper (uses its' own > dedicated EMS drive) > Slaves: xLarge instance with 2 CPUs and 17.5 GB RAM each - run 1 datanode, 1 > tasktracker, 1 regionserver > > We have also installed Ganglia to monitor the cluster stats as we are about > to run some performance tests but, right out of the box, we are noticing > high system loads (especially on the master node) without any activity > happening on the clister. Of course, the CPUs are not being utilized at all, > but Ganglia is reporting almost all nodes in the red as the 1, 5 an 15 > minute load times are all above 100% most of the time (i.e. there are more > than two processes at a time competing for the 2 CPUs time). > > Question1: is this normal? > Question2: does it matter since each process barely uses any of the CPU > time? > > Thank you in advance and pardon the noob questions. > > -GS > +
Jean-Daniel Cryans 2010-08-17, 22:14
-
Re: High OS Load Numbers when idleGeorge P. Stathis 2010-08-17, 22:49
Actually, there is nothing in %wa but a ton sitting in %id. This is
from the Master: top - 18:30:24 up 5 days, 20:10, 1 user, load average: 2.55, 1.99, 1.25 Tasks: 89 total, 1 running, 88 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.2%st Mem: 17920228k total, 2795464k used, 15124764k free, 248428k buffers Swap: 0k total, 0k used, 0k free, 1398388k cached I have atop installed which is reporting the hadoop/hbase java daemons as the most active processes (barely taking any CPU time though and most of the time in sleep mode): ATOP - domU-12-31-39-18-1 2010/08/17 18:31:46 10 seconds elapsed PRC | sys 0.01s | user 0.00s | #proc 89 | #zombie 0 | #exit 0 | CPU | sys 0% | user 0% | irq 0% | idle 200% | wait 0% | cpu | sys 0% | user 0% | irq 0% | idle 100% | cpu000 w 0% | CPL | avg1 2.55 | avg5 2.12 | avg15 1.35 | csw 2397 | intr 2034 | MEM | tot 17.1G | free 14.4G | cache 1.3G | buff 242.6M | slab 193.1M | SWP | tot 0.0M | free 0.0M | | vmcom 1.6G | vmlim 8.5G | NET | transport | tcpi 330 | tcpo 169 | udpi 566 | udpo 147 | NET | network | ipi 896 | ipo 316 | ipfrw 0 | deliv 896 | NET | eth0 ---- | pcki 777 | pcko 197 | si 248 Kbps | so 70 Kbps | NET | lo ---- | pcki 119 | pcko 119 | si 9 Kbps | so 9 Kbps | PID CPU COMMAND-LINE 1/1 17613 0% atop 17150 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -XX:+HeapDumpOnOutOfMemor 16527 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server -Dcom.sun.managem 16839 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server -Dcom.sun.managem 16735 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server -Dcom.sun.managem 17083 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -XX:+HeapDumpOnOutOfMemor Same with atop: PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 16527 ubuntu 20 0 2352M 98M 10336 S 0.0 0.6 0:42.05 /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote -Dhadoop.log.dir=/var/log/h 16735 ubuntu 20 0 2403M 81544 10236 S 0.0 0.5 0:01.56 /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote -Dhadoop.log.dir=/var/log/h 17083 ubuntu 20 0 4557M 45388 10912 S 0.0 0.3 0:00.65 /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -server -XX:+Heap 1 root 20 0 23684 1880 1272 S 0.0 0.0 0:00.23 /sbin/init 587 root 20 0 247M 4088 2432 S 0.0 0.0 -596523h-14:-8 /usr/sbin/console-kit-daemon --no-daemon 3336 root 20 0 49256 1092 540 S 0.0 0.0 0:00.36 /usr/sbin/sshd 16430 nobody 20 0 34408 3704 1060 S 0.0 0.0 0:00.01 gmond 17150 ubuntu 20 0 2519M 112M 11312 S 0.0 0.6 -596523h-14:-8 /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -server -XX So I'm a bit perplexed. Are there any hadoop / hbase specific tricks that can reveal what these processes are doing? -GS On Tue, Aug 17, 2010 at 6:14 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: > > It's not normal, but then again I don't have access to your machines > so I can only speculate. > > Does "top" show you which process is in %wa? If so and it's a java > process, can you figure what's going on in there? > > J-D > > On Tue, Aug 17, 2010 at 11:03 AM, George Stathis <[EMAIL PROTECTED]> wrote: > > Hello, > > > > We have just setup a new cluster on EC2 using Hadoop 0.20.2 and HBase > > 0.20.3. Our small setup as of right now consists of one master and four > > slaves with a replication factor of 2: > > > > Master: xLarge instance with 2 CPUs and 17.5 GB RAM - runs 1 namenode, 1 +
George P. Stathis 2010-08-17, 22:49
-
Re: High OS Load Numbers when idleRyan Rawson 2010-08-17, 22:54
what does vmstat say? Run it like 'vmstat 2' for a minute or two and
paste the results. With no cpu being consumed by java, it seems like there must be another hidden variable here. Some zombied process perhaps. Or some kind of super IO wait or something else. Since you are running on a hypervisor environment, i cant really say what is happening to your instance, although one would think the LA numbers would be unaffected by outside processes. On Tue, Aug 17, 2010 at 3:49 PM, George P. Stathis <[EMAIL PROTECTED]> wrote: > Actually, there is nothing in %wa but a ton sitting in %id. This is > from the Master: > > top - 18:30:24 up 5 days, 20:10, 1 user, load average: 2.55, 1.99, 1.25 > Tasks: 89 total, 1 running, 88 sleeping, 0 stopped, 0 zombie > Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.2%st > Mem: 17920228k total, 2795464k used, 15124764k free, 248428k buffers > Swap: 0k total, 0k used, 0k free, 1398388k cached > > I have atop installed which is reporting the hadoop/hbase java daemons > as the most active processes (barely taking any CPU time though and > most of the time in sleep mode): > > ATOP - domU-12-31-39-18-1 2010/08/17 18:31:46 10 seconds elapsed > PRC | sys 0.01s | user 0.00s | #proc 89 | #zombie 0 | #exit 0 | > CPU | sys 0% | user 0% | irq 0% | idle 200% | wait 0% | > cpu | sys 0% | user 0% | irq 0% | idle 100% | cpu000 w 0% | > CPL | avg1 2.55 | avg5 2.12 | avg15 1.35 | csw 2397 | intr 2034 | > MEM | tot 17.1G | free 14.4G | cache 1.3G | buff 242.6M | slab 193.1M | > SWP | tot 0.0M | free 0.0M | | vmcom 1.6G | vmlim 8.5G | > NET | transport | tcpi 330 | tcpo 169 | udpi 566 | udpo 147 | > NET | network | ipi 896 | ipo 316 | ipfrw 0 | deliv 896 | > NET | eth0 ---- | pcki 777 | pcko 197 | si 248 Kbps | so 70 Kbps | > NET | lo ---- | pcki 119 | pcko 119 | si 9 Kbps | so 9 Kbps | > > PID CPU COMMAND-LINE 1/1 > 17613 0% atop > 17150 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -XX:+HeapDumpOnOutOfMemor > 16527 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server -Dcom.sun.managem > 16839 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server -Dcom.sun.managem > 16735 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server -Dcom.sun.managem > 17083 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -XX:+HeapDumpOnOutOfMemor > > Same with atop: > > PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command > 16527 ubuntu 20 0 2352M 98M 10336 S 0.0 0.6 0:42.05 > /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server > -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote > -Dhadoop.log.dir=/var/log/h > 16735 ubuntu 20 0 2403M 81544 10236 S 0.0 0.5 0:01.56 > /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server > -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote > -Dhadoop.log.dir=/var/log/h > 17083 ubuntu 20 0 4557M 45388 10912 S 0.0 0.3 0:00.65 > /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m > -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC > -XX:+CMSIncrementalMode -server -XX:+Heap > 1 root 20 0 23684 1880 1272 S 0.0 0.0 0:00.23 /sbin/init > 587 root 20 0 247M 4088 2432 S 0.0 0.0 -596523h-14:-8 > /usr/sbin/console-kit-daemon --no-daemon > 3336 root 20 0 49256 1092 540 S 0.0 0.0 0:00.36 /usr/sbin/sshd > 16430 nobody 20 0 34408 3704 1060 S 0.0 0.0 0:00.01 gmond > 17150 ubuntu 20 0 2519M 112M 11312 S 0.0 0.6 -596523h-14:-8 > /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m > -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC > -XX:+CMSIncrementalMode -server -XX > > > So I'm a bit perplexed. Are there any hadoop / hbase specific tricks > that can reveal what these processes are doing? > > -GS > +
Ryan Rawson 2010-08-17, 22:54
-
Re: High OS Load Numbers when idleGeorge P. Stathis 2010-08-17, 23:03
vmstat 2 for 2 mins below. Looks like everything is in idle (github
gist paste if it's easier to read: http://gist.github.com/532512): procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 15097116 248428 1398444 0 0 0 50 5 24 0 0 100 0 0 0 0 15096948 248428 1398444 0 0 0 0 281 281 0 0 100 0 0 0 0 15096948 248428 1398444 0 0 0 0 279 260 0 0 100 0 0 0 0 15096948 248428 1398444 0 0 0 0 199 216 0 0 100 0 4 0 0 15096612 248428 1398448 0 0 0 0 528 467 0 0 100 0 0 0 0 15096612 248428 1398448 0 0 0 0 208 213 0 0 100 0 4 0 0 15096460 248428 1398448 0 0 0 0 251 261 0 0 100 0 0 0 0 15096460 248428 1398448 0 0 0 12 242 248 0 0 100 0 0 0 0 15096460 248428 1398448 0 0 0 34 228 230 0 0 100 0 0 0 0 15096476 248428 1398448 0 0 0 0 266 272 0 0 100 0 0 0 0 15096324 248428 1398448 0 0 0 10 179 206 0 0 100 0 1 0 0 15096340 248428 1398448 0 0 0 0 225 254 0 0 100 0 1 0 0 15096188 248428 1398448 0 0 0 0 263 245 0 0 100 0 0 0 0 15096188 248428 1398448 0 0 0 2 169 210 0 0 100 0 0 0 0 15096188 248428 1398448 0 0 0 0 201 238 0 0 100 0 0 0 0 15096036 248428 1398448 0 0 0 10 174 202 0 0 100 0 0 0 0 15096036 248428 1398448 0 0 0 6 207 222 0 0 100 0 0 0 0 15095884 248428 1398448 0 0 0 0 198 242 0 0 100 0 2 0 0 15095884 248428 1398448 0 0 0 0 177 215 0 0 100 0 0 0 0 15095884 248428 1398448 0 0 0 0 244 265 0 0 100 0 0 0 0 15095732 248428 1398448 0 0 0 4 197 222 0 0 100 0 6 0 0 15095732 248428 1398448 0 0 0 6 267 260 0 0 100 0 0 0 0 15095732 248428 1398448 0 0 0 0 240 239 0 0 100 0 0 0 0 15095580 248428 1398448 0 0 0 8 180 210 0 0 100 0 5 0 0 15095580 248428 1398448 0 0 0 0 193 224 0 0 100 0 1 0 0 15095580 248428 1398448 0 0 0 36 161 191 0 0 100 0 0 0 0 15095428 248428 1398448 0 0 0 0 176 216 0 0 100 0 4 0 0 15095428 248428 1398448 0 0 0 0 202 236 0 0 100 0 0 0 0 15095428 248428 1398448 0 0 0 6 191 220 0 0 100 0 1 0 0 15095428 248428 1398448 0 0 0 0 188 238 0 0 100 0 2 0 0 15095276 248428 1398448 0 0 0 0 174 206 0 0 100 0 1 0 0 15095276 248428 1398448 0 0 0 0 225 249 0 0 100 0 0 0 0 15095124 248428 1398448 0 0 0 0 222 263 0 0 100 0 1 0 0 15095124 248428 1398448 0 0 0 6 187 236 0 0 100 0 5 0 0 15094940 248428 1398448 0 0 0 0 453 434 0 0 100 0 4 0 0 15094788 248428 1398448 0 0 0 2 227 225 0 0 100 0 0 0 0 15094788 248428 1398448 0 0 0 0 213 236 0 0 100 0 4 0 0 15094788 248428 1398448 0 0 0 6 257 253 0 0 100 0 0 0 0 15094636 248428 1398448 0 0 0 0 215 230 0 0 100 0 0 0 0 15094652 248428 1398448 0 0 0 0 259 285 0 0 100 0 0 0 0 15094500 248428 1398448 0 0 0 14 194 219 0 0 100 0 0 0 0 15094516 248428 1398448 0 0 0 0 227 257 0 0 100 0 4 0 0 15094516 248428 1398448 0 0 0 36 266 263 0 0 100 0 1 0 0 15094516 248428 1398448 0 0 0 0 202 213 0 0 100 0 0 0 0 15094364 248428 1398448 0 0 0 0 204 240 0 0 100 0 0 0 0 15094212 248428 1398448 0 0 0 6 161 194 0 0 100 0 0 0 0 15094212 248428 1398448 0 0 0 0 191 215 0 0 100 0 1 0 0 15094212 248428 1398448 0 0 0 0 216 238 0 0 100 0 5 0 0 15094212 248428 1398448 0 0 0 6 169 202 0 0 100 0 0 0 0 15094060 248428 1398448 0 0 0 0 172 216 0 0 100 0 2 0 0 15094060 248428 1398448 0 0 0 6 201 196 0 0 100 0 1 0 0 15094060 248428 1398448 0 0 0 0 196 218 0 0 100 0 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 15093908 248428 1398448 0 0 0 0 206 236 0 0 100 0 4 0 0 15093908 248428 1398448 0 0 0 0 197 219 0 0 100 0 0 0 0 15093908 248428 1398448 0 0 0 0 186 227 0 0 100 0 0 0 0 15093756 248428 1398448 0 0 0 0 168 182 0 0 100 0 0 0 0 15093756 248428 1398448 0 0 0 0 206 239 0 0 100 0 0 0 0 15093604 248428 1398448 0 0 0 6 281 248 0 0 100 0 0 0 0 15093452 248428 1398448 0 0 0 0 185 198 0 0 100 0 5 0 0 15093452 248428 1398448 0 0 0 0 265 253 0 0 100 0 0 0 0 15093300 248428 1398448 0 0 0 36 194 211 0 0 100 0 0 0 0 15093300 248428 1398448 0 0 0 0 228 242 0 0 100 0 0 0 0 15093300 248428 1398448 0 0 0 0 290 262 0 0 100 0 0 0 0 15093300 248428 1398448 0 0 0 6 187 207 0 0 100 0 On Tue, Aug 17, 2010 at 6:54 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote: +
George P. Stathis 2010-08-17, 23:03
-
Re: High OS Load Numbers when idleRyan Rawson 2010-08-17, 23:08
This is odd, the run queue has quite a bit of processes in it, but the
actual cpu data says 'nothings happening'. Can't tell which processes are in the run queue here, does the load average go down when you take hadoop/hbase down? If so, that seems to identify the locus of the problem, but doesn't really explain WHY this is so. Consider I've never really seen this before I'd just write it off to using a hypervisor'ed environment and blame other people stealing all your cpus :-) On Tue, Aug 17, 2010 at 4:03 PM, George P. Stathis <[EMAIL PROTECTED]> wrote: > vmstat 2 for 2 mins below. Looks like everything is in idle (github > gist paste if it's easier to read: http://gist.github.com/532512): > > > procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 0 0 0 15097116 248428 1398444 0 0 0 50 5 24 > 0 0 100 0 > 0 0 0 15096948 248428 1398444 0 0 0 0 281 281 > 0 0 100 0 > 0 0 0 15096948 248428 1398444 0 0 0 0 279 260 > 0 0 100 0 > 0 0 0 15096948 248428 1398444 0 0 0 0 199 216 > 0 0 100 0 > 4 0 0 15096612 248428 1398448 0 0 0 0 528 467 > 0 0 100 0 > 0 0 0 15096612 248428 1398448 0 0 0 0 208 213 > 0 0 100 0 > 4 0 0 15096460 248428 1398448 0 0 0 0 251 261 > 0 0 100 0 > 0 0 0 15096460 248428 1398448 0 0 0 12 242 248 > 0 0 100 0 > 0 0 0 15096460 248428 1398448 0 0 0 34 228 230 > 0 0 100 0 > 0 0 0 15096476 248428 1398448 0 0 0 0 266 272 > 0 0 100 0 > 0 0 0 15096324 248428 1398448 0 0 0 10 179 206 > 0 0 100 0 > 1 0 0 15096340 248428 1398448 0 0 0 0 225 254 > 0 0 100 0 > 1 0 0 15096188 248428 1398448 0 0 0 0 263 245 > 0 0 100 0 > 0 0 0 15096188 248428 1398448 0 0 0 2 169 210 > 0 0 100 0 > 0 0 0 15096188 248428 1398448 0 0 0 0 201 238 > 0 0 100 0 > 0 0 0 15096036 248428 1398448 0 0 0 10 174 202 > 0 0 100 0 > 0 0 0 15096036 248428 1398448 0 0 0 6 207 222 > 0 0 100 0 > 0 0 0 15095884 248428 1398448 0 0 0 0 198 242 > 0 0 100 0 > 2 0 0 15095884 248428 1398448 0 0 0 0 177 215 > 0 0 100 0 > 0 0 0 15095884 248428 1398448 0 0 0 0 244 265 > 0 0 100 0 > 0 0 0 15095732 248428 1398448 0 0 0 4 197 222 > 0 0 100 0 > 6 0 0 15095732 248428 1398448 0 0 0 6 267 260 > 0 0 100 0 > 0 0 0 15095732 248428 1398448 0 0 0 0 240 239 > 0 0 100 0 > 0 0 0 15095580 248428 1398448 0 0 0 8 180 210 > 0 0 100 0 > 5 0 0 15095580 248428 1398448 0 0 0 0 193 224 > 0 0 100 0 > 1 0 0 15095580 248428 1398448 0 0 0 36 161 191 > 0 0 100 0 > 0 0 0 15095428 248428 1398448 0 0 0 0 176 216 > 0 0 100 0 > 4 0 0 15095428 248428 1398448 0 0 0 0 202 236 > 0 0 100 0 > 0 0 0 15095428 248428 1398448 0 0 0 6 191 220 > 0 0 100 0 > 1 0 0 15095428 248428 1398448 0 0 0 0 188 238 > 0 0 100 0 > 2 0 0 15095276 248428 1398448 0 0 0 0 174 206 > 0 0 100 0 > 1 0 0 15095276 248428 1398448 0 0 0 0 225 249 > 0 0 100 0 > 0 0 0 15095124 248428 1398448 0 0 0 0 222 263 > 0 0 100 0 > 1 0 0 15095124 248428 1398448 0 0 0 6 187 236 > 0 0 100 0 > 5 0 0 15094940 248428 1398448 0 0 0 0 453 434 > 0 0 100 0 > 4 0 0 15094788 248428 1398448 0 0 0 2 227 225 > 0 0 100 0 > 0 0 0 15094788 248428 1398448 0 0 0 0 213 236 +
Ryan Rawson 2010-08-17, 23:08
-
Re: High OS Load Numbers when idleGeorge P. Stathis 2010-08-17, 23:39
Yeap. Load average definitely goes back down to 0.10 or below once
hadoop/hbase is shut off. Same issue is happening on the slaves as well BTW. We are running on Ubuntu 10.04 which is the only recent change other than moving to a cluster setup. We were running a pseudo-distributed setup before on one machine running Ubuntu 9.10. Top shows just as much activity under %id but load average was close to zero when idle! Environment is hypervisor'ed as well, so I suspect that's not really the issue...:-) Know anyone running on an Ubuntu 10.04 setup? -GS On Tue, Aug 17, 2010 at 7:08 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote: > This is odd, the run queue has quite a bit of processes in it, but the > actual cpu data says 'nothings happening'. > > Can't tell which processes are in the run queue here, does the load > average go down when you take hadoop/hbase down? > > If so, that seems to identify the locus of the problem, but doesn't > really explain WHY this is so. Consider I've never really seen this > before I'd just write it off to using a hypervisor'ed environment and > blame other people stealing all your cpus :-) > > On Tue, Aug 17, 2010 at 4:03 PM, George P. Stathis <[EMAIL PROTECTED]> wrote: >> vmstat 2 for 2 mins below. Looks like everything is in idle (github >> gist paste if it's easier to read: http://gist.github.com/532512): >> >> >> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- >> r b swpd free buff cache si so bi bo in cs us sy id wa >> 0 0 0 15097116 248428 1398444 0 0 0 50 5 24 >> 0 0 100 0 >> 0 0 0 15096948 248428 1398444 0 0 0 0 281 281 >> 0 0 100 0 >> 0 0 0 15096948 248428 1398444 0 0 0 0 279 260 >> 0 0 100 0 >> 0 0 0 15096948 248428 1398444 0 0 0 0 199 216 >> 0 0 100 0 >> 4 0 0 15096612 248428 1398448 0 0 0 0 528 467 >> 0 0 100 0 >> 0 0 0 15096612 248428 1398448 0 0 0 0 208 213 >> 0 0 100 0 >> 4 0 0 15096460 248428 1398448 0 0 0 0 251 261 >> 0 0 100 0 >> 0 0 0 15096460 248428 1398448 0 0 0 12 242 248 >> 0 0 100 0 >> 0 0 0 15096460 248428 1398448 0 0 0 34 228 230 >> 0 0 100 0 >> 0 0 0 15096476 248428 1398448 0 0 0 0 266 272 >> 0 0 100 0 >> 0 0 0 15096324 248428 1398448 0 0 0 10 179 206 >> 0 0 100 0 >> 1 0 0 15096340 248428 1398448 0 0 0 0 225 254 >> 0 0 100 0 >> 1 0 0 15096188 248428 1398448 0 0 0 0 263 245 >> 0 0 100 0 >> 0 0 0 15096188 248428 1398448 0 0 0 2 169 210 >> 0 0 100 0 >> 0 0 0 15096188 248428 1398448 0 0 0 0 201 238 >> 0 0 100 0 >> 0 0 0 15096036 248428 1398448 0 0 0 10 174 202 >> 0 0 100 0 >> 0 0 0 15096036 248428 1398448 0 0 0 6 207 222 >> 0 0 100 0 >> 0 0 0 15095884 248428 1398448 0 0 0 0 198 242 >> 0 0 100 0 >> 2 0 0 15095884 248428 1398448 0 0 0 0 177 215 >> 0 0 100 0 >> 0 0 0 15095884 248428 1398448 0 0 0 0 244 265 >> 0 0 100 0 >> 0 0 0 15095732 248428 1398448 0 0 0 4 197 222 >> 0 0 100 0 >> 6 0 0 15095732 248428 1398448 0 0 0 6 267 260 >> 0 0 100 0 >> 0 0 0 15095732 248428 1398448 0 0 0 0 240 239 >> 0 0 100 0 >> 0 0 0 15095580 248428 1398448 0 0 0 8 180 210 >> 0 0 100 0 >> 5 0 0 15095580 248428 1398448 0 0 0 0 193 224 >> 0 0 100 0 >> 1 0 0 15095580 248428 1398448 0 0 0 36 161 191 >> 0 0 100 0 >> 0 0 0 15095428 248428 1398448 0 0 0 0 176 216 >> 0 0 100 0 >> 4 0 0 15095428 248428 1398448 0 0 0 0 202 236 >> 0 0 100 0 >> 0 0 0 15095428 248428 1398448 0 0 0 6 191 220 +
George P. Stathis 2010-08-17, 23:39
-
Re: High OS Load Numbers when idleGeorge P. Stathis 2010-08-18, 00:28
OK, high load averages while idle is a confirmed Lucid issue:
https://bugs.launchpad.net/ubuntu-on-ec2/+bug/574910 Will downgrade back to 9.10 and try again. Thanks for helping out Ryan. If you hadn't confirmed that this is not normal, I would have wasted a lot more time looking in the wrong places. Now I just have to rebuild us a new cluster... -GS On Tue, Aug 17, 2010 at 7:39 PM, George P. Stathis <[EMAIL PROTECTED]> wrote: > Yeap. Load average definitely goes back down to 0.10 or below once > hadoop/hbase is shut off. Same issue is happening on the slaves as > well BTW. We are running on Ubuntu 10.04 which is the only recent > change other than moving to a cluster setup. We were running a > pseudo-distributed setup before on one machine running Ubuntu 9.10. > Top shows just as much activity under %id but load average was close > to zero when idle! Environment is hypervisor'ed as well, so I suspect > that's not really the issue...:-) Know anyone running on an Ubuntu > 10.04 setup? > > -GS > > On Tue, Aug 17, 2010 at 7:08 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote: >> This is odd, the run queue has quite a bit of processes in it, but the >> actual cpu data says 'nothings happening'. >> >> Can't tell which processes are in the run queue here, does the load >> average go down when you take hadoop/hbase down? >> >> If so, that seems to identify the locus of the problem, but doesn't >> really explain WHY this is so. Consider I've never really seen this >> before I'd just write it off to using a hypervisor'ed environment and >> blame other people stealing all your cpus :-) >> >> On Tue, Aug 17, 2010 at 4:03 PM, George P. Stathis <[EMAIL PROTECTED]> wrote: >>> vmstat 2 for 2 mins below. Looks like everything is in idle (github >>> gist paste if it's easier to read: http://gist.github.com/532512): >>> >>> >>> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- >>> r b swpd free buff cache si so bi bo in cs us sy id wa >>> 0 0 0 15097116 248428 1398444 0 0 0 50 5 24 >>> 0 0 100 0 >>> 0 0 0 15096948 248428 1398444 0 0 0 0 281 281 >>> 0 0 100 0 >>> 0 0 0 15096948 248428 1398444 0 0 0 0 279 260 >>> 0 0 100 0 >>> 0 0 0 15096948 248428 1398444 0 0 0 0 199 216 >>> 0 0 100 0 >>> 4 0 0 15096612 248428 1398448 0 0 0 0 528 467 >>> 0 0 100 0 >>> 0 0 0 15096612 248428 1398448 0 0 0 0 208 213 >>> 0 0 100 0 >>> 4 0 0 15096460 248428 1398448 0 0 0 0 251 261 >>> 0 0 100 0 >>> 0 0 0 15096460 248428 1398448 0 0 0 12 242 248 >>> 0 0 100 0 >>> 0 0 0 15096460 248428 1398448 0 0 0 34 228 230 >>> 0 0 100 0 >>> 0 0 0 15096476 248428 1398448 0 0 0 0 266 272 >>> 0 0 100 0 >>> 0 0 0 15096324 248428 1398448 0 0 0 10 179 206 >>> 0 0 100 0 >>> 1 0 0 15096340 248428 1398448 0 0 0 0 225 254 >>> 0 0 100 0 >>> 1 0 0 15096188 248428 1398448 0 0 0 0 263 245 >>> 0 0 100 0 >>> 0 0 0 15096188 248428 1398448 0 0 0 2 169 210 >>> 0 0 100 0 >>> 0 0 0 15096188 248428 1398448 0 0 0 0 201 238 >>> 0 0 100 0 >>> 0 0 0 15096036 248428 1398448 0 0 0 10 174 202 >>> 0 0 100 0 >>> 0 0 0 15096036 248428 1398448 0 0 0 6 207 222 >>> 0 0 100 0 >>> 0 0 0 15095884 248428 1398448 0 0 0 0 198 242 >>> 0 0 100 0 >>> 2 0 0 15095884 248428 1398448 0 0 0 0 177 215 >>> 0 0 100 0 >>> 0 0 0 15095884 248428 1398448 0 0 0 0 244 265 >>> 0 0 100 0 >>> 0 0 0 15095732 248428 1398448 0 0 0 4 197 222 >>> 0 0 100 0 >>> 6 0 0 15095732 248428 1398448 0 0 0 6 267 260 >>> 0 0 100 0 >>> 0 0 0 15095732 248428 1398448 0 0 0 0 240 239 +
George P. Stathis 2010-08-18, 00:28
-
Re: High OS Load Numbers when idleRyan Rawson 2010-08-18, 00:59
You're welcome. Glad you figured it out! Very strange, and it
totally makes sense this is a kernel/os bug. -ryan On Tue, Aug 17, 2010 at 5:28 PM, George P. Stathis <[EMAIL PROTECTED]> wrote: > OK, high load averages while idle is a confirmed Lucid issue: > https://bugs.launchpad.net/ubuntu-on-ec2/+bug/574910 > > Will downgrade back to 9.10 and try again. Thanks for helping out > Ryan. If you hadn't confirmed that this is not normal, I would have > wasted a lot more time looking in the wrong places. Now I just have to > rebuild us a new cluster... > > -GS > > On Tue, Aug 17, 2010 at 7:39 PM, George P. Stathis <[EMAIL PROTECTED]> wrote: >> Yeap. Load average definitely goes back down to 0.10 or below once >> hadoop/hbase is shut off. Same issue is happening on the slaves as >> well BTW. We are running on Ubuntu 10.04 which is the only recent >> change other than moving to a cluster setup. We were running a >> pseudo-distributed setup before on one machine running Ubuntu 9.10. >> Top shows just as much activity under %id but load average was close >> to zero when idle! Environment is hypervisor'ed as well, so I suspect >> that's not really the issue...:-) Know anyone running on an Ubuntu >> 10.04 setup? >> >> -GS >> >> On Tue, Aug 17, 2010 at 7:08 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote: >>> This is odd, the run queue has quite a bit of processes in it, but the >>> actual cpu data says 'nothings happening'. >>> >>> Can't tell which processes are in the run queue here, does the load >>> average go down when you take hadoop/hbase down? >>> >>> If so, that seems to identify the locus of the problem, but doesn't >>> really explain WHY this is so. Consider I've never really seen this >>> before I'd just write it off to using a hypervisor'ed environment and >>> blame other people stealing all your cpus :-) >>> >>> On Tue, Aug 17, 2010 at 4:03 PM, George P. Stathis <[EMAIL PROTECTED]> wrote: >>>> vmstat 2 for 2 mins below. Looks like everything is in idle (github >>>> gist paste if it's easier to read: http://gist.github.com/532512): >>>> >>>> >>>> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- >>>> r b swpd free buff cache si so bi bo in cs us sy id wa >>>> 0 0 0 15097116 248428 1398444 0 0 0 50 5 24 >>>> 0 0 100 0 >>>> 0 0 0 15096948 248428 1398444 0 0 0 0 281 281 >>>> 0 0 100 0 >>>> 0 0 0 15096948 248428 1398444 0 0 0 0 279 260 >>>> 0 0 100 0 >>>> 0 0 0 15096948 248428 1398444 0 0 0 0 199 216 >>>> 0 0 100 0 >>>> 4 0 0 15096612 248428 1398448 0 0 0 0 528 467 >>>> 0 0 100 0 >>>> 0 0 0 15096612 248428 1398448 0 0 0 0 208 213 >>>> 0 0 100 0 >>>> 4 0 0 15096460 248428 1398448 0 0 0 0 251 261 >>>> 0 0 100 0 >>>> 0 0 0 15096460 248428 1398448 0 0 0 12 242 248 >>>> 0 0 100 0 >>>> 0 0 0 15096460 248428 1398448 0 0 0 34 228 230 >>>> 0 0 100 0 >>>> 0 0 0 15096476 248428 1398448 0 0 0 0 266 272 >>>> 0 0 100 0 >>>> 0 0 0 15096324 248428 1398448 0 0 0 10 179 206 >>>> 0 0 100 0 >>>> 1 0 0 15096340 248428 1398448 0 0 0 0 225 254 >>>> 0 0 100 0 >>>> 1 0 0 15096188 248428 1398448 0 0 0 0 263 245 >>>> 0 0 100 0 >>>> 0 0 0 15096188 248428 1398448 0 0 0 2 169 210 >>>> 0 0 100 0 >>>> 0 0 0 15096188 248428 1398448 0 0 0 0 201 238 >>>> 0 0 100 0 >>>> 0 0 0 15096036 248428 1398448 0 0 0 10 174 202 >>>> 0 0 100 0 >>>> 0 0 0 15096036 248428 1398448 0 0 0 6 207 222 >>>> 0 0 100 0 >>>> 0 0 0 15095884 248428 1398448 0 0 0 0 198 242 >>>> 0 0 100 0 >>>> 2 0 0 15095884 248428 1398448 0 0 0 0 177 215 >>>> 0 0 100 0 >>>> 0 0 0 15095884 248428 1398448 0 0 0 0 244 265 +
Ryan Rawson 2010-08-18, 00:59
-
Re: High OS Load Numbers when idleGeorge P. Stathis 2010-08-18, 02:04
Thank you to JD as well for first response!
On Tue, Aug 17, 2010 at 8:59 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote: > You're welcome. Glad you figured it out! Very strange, and it > totally makes sense this is a kernel/os bug. > > -ryan > > On Tue, Aug 17, 2010 at 5:28 PM, George P. Stathis <[EMAIL PROTECTED]> wrote: >> OK, high load averages while idle is a confirmed Lucid issue: >> https://bugs.launchpad.net/ubuntu-on-ec2/+bug/574910 >> >> Will downgrade back to 9.10 and try again. Thanks for helping out >> Ryan. If you hadn't confirmed that this is not normal, I would have >> wasted a lot more time looking in the wrong places. Now I just have to >> rebuild us a new cluster... >> >> -GS >> >> On Tue, Aug 17, 2010 at 7:39 PM, George P. Stathis <[EMAIL PROTECTED]> wrote: >>> Yeap. Load average definitely goes back down to 0.10 or below once >>> hadoop/hbase is shut off. Same issue is happening on the slaves as >>> well BTW. We are running on Ubuntu 10.04 which is the only recent >>> change other than moving to a cluster setup. We were running a >>> pseudo-distributed setup before on one machine running Ubuntu 9.10. >>> Top shows just as much activity under %id but load average was close >>> to zero when idle! Environment is hypervisor'ed as well, so I suspect >>> that's not really the issue...:-) Know anyone running on an Ubuntu >>> 10.04 setup? >>> >>> -GS >>> >>> On Tue, Aug 17, 2010 at 7:08 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote: >>>> This is odd, the run queue has quite a bit of processes in it, but the >>>> actual cpu data says 'nothings happening'. >>>> >>>> Can't tell which processes are in the run queue here, does the load >>>> average go down when you take hadoop/hbase down? >>>> >>>> If so, that seems to identify the locus of the problem, but doesn't >>>> really explain WHY this is so. Consider I've never really seen this >>>> before I'd just write it off to using a hypervisor'ed environment and >>>> blame other people stealing all your cpus :-) >>>> >>>> On Tue, Aug 17, 2010 at 4:03 PM, George P. Stathis <[EMAIL PROTECTED]> wrote: >>>>> vmstat 2 for 2 mins below. Looks like everything is in idle (github >>>>> gist paste if it's easier to read: http://gist.github.com/532512): >>>>> >>>>> >>>>> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- >>>>> r b swpd free buff cache si so bi bo in cs us sy id wa >>>>> 0 0 0 15097116 248428 1398444 0 0 0 50 5 24 >>>>> 0 0 100 0 >>>>> 0 0 0 15096948 248428 1398444 0 0 0 0 281 281 >>>>> 0 0 100 0 >>>>> 0 0 0 15096948 248428 1398444 0 0 0 0 279 260 >>>>> 0 0 100 0 >>>>> 0 0 0 15096948 248428 1398444 0 0 0 0 199 216 >>>>> 0 0 100 0 >>>>> 4 0 0 15096612 248428 1398448 0 0 0 0 528 467 >>>>> 0 0 100 0 >>>>> 0 0 0 15096612 248428 1398448 0 0 0 0 208 213 >>>>> 0 0 100 0 >>>>> 4 0 0 15096460 248428 1398448 0 0 0 0 251 261 >>>>> 0 0 100 0 >>>>> 0 0 0 15096460 248428 1398448 0 0 0 12 242 248 >>>>> 0 0 100 0 >>>>> 0 0 0 15096460 248428 1398448 0 0 0 34 228 230 >>>>> 0 0 100 0 >>>>> 0 0 0 15096476 248428 1398448 0 0 0 0 266 272 >>>>> 0 0 100 0 >>>>> 0 0 0 15096324 248428 1398448 0 0 0 10 179 206 >>>>> 0 0 100 0 >>>>> 1 0 0 15096340 248428 1398448 0 0 0 0 225 254 >>>>> 0 0 100 0 >>>>> 1 0 0 15096188 248428 1398448 0 0 0 0 263 245 >>>>> 0 0 100 0 >>>>> 0 0 0 15096188 248428 1398448 0 0 0 2 169 210 >>>>> 0 0 100 0 >>>>> 0 0 0 15096188 248428 1398448 0 0 0 0 201 238 >>>>> 0 0 100 0 >>>>> 0 0 0 15096036 248428 1398448 0 0 0 10 174 202 >>>>> 0 0 100 0 >>>>> 0 0 0 15096036 248428 1398448 0 0 0 6 207 222 >>>>> 0 0 100 0 +
George P. Stathis 2010-08-18, 02:04
|