|
Roman Shaposhnik
2011-11-03, 22:34
Stack
2011-11-03, 22:48
Ted Yu
2011-11-03, 22:52
Roman Shaposhnik
2011-11-03, 23:01
Andrew Purtell
2011-11-03, 23:15
Ted Yu
2011-11-03, 23:37
Shvachko, Konstantin
2011-11-04, 00:35
Stack
2011-11-05, 04:02
Roman Shaposhnik
2011-11-05, 23:28
Ted Yu
2011-11-06, 00:36
Stack
2011-11-06, 22:33
Roman Shaposhnik
2011-11-07, 00:12
Roman Shaposhnik
2011-11-07, 02:38
Ted Yu
2011-11-07, 05:00
Ted Yu
2011-11-07, 05:54
Roman Shaposhnik
2011-11-08, 06:16
Roman Shaposhnik
2011-11-08, 06:37
Ted Yu
2011-11-08, 17:20
Roman Shaposhnik
2011-11-08, 17:26
Ted Yu
2011-11-08, 17:33
Stack
2011-11-08, 22:06
Roman Shaposhnik
2011-11-08, 22:29
Roman Shaposhnik
2011-11-09, 00:10
Ted Yu
2011-11-09, 00:20
Roman Shaposhnik
2011-11-09, 00:27
Roman Shaposhnik
2011-11-09, 19:44
Todd Lipcon
2011-11-09, 21:38
Andrew Purtell
2011-11-10, 15:22
Andrew Purtell
2011-11-10, 15:25
Roman Shaposhnik
2011-11-11, 00:33
Andrew Purtell
2011-11-11, 03:53
Stack
2011-11-11, 16:50
Roman Shaposhnik
2011-11-11, 17:12
Andrew Purtell
2011-11-11, 22:56
Konstantin Shvachko
2011-11-15, 02:19
|
-
HBase 0.92/Hadoop 0.22 test resultsRoman Shaposhnik 2011-11-03, 22:34
So here's the run after I resolved all the set up issues:
http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/14/testReport/ Here's what I see timing out: http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/14/testReport/org.apache.bigtop.itest.hbase.smoke/TestHBaseCompression/testNoCompression/ http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/14/testReport/org.apache.bigtop.itest.hbase.smoke/TestHBaseCompression/testGzipCompression/ Which is basically simply: $ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://ip-10-32-33-167.ec2.internal:17020/user/root/snappy-output/testfile.gz gz $ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://ip-10-32-33-167.ec2.internal:17020/user/root/snappy-output/testfile.none none And yes, they are hanging when executed by hand as well. Which is weird too, since the test itself actually completes, and exits org.apache.hadoop.hbase.util.CompressionTest.main and then everything freezes over with the following stack trace: $ jstack 8412 2011-11-03 18:33:37 Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.0-b16 mixed mode): "Attach Listener" daemon prio=10 tid=0x00007f4e6c001000 nid=0x213a waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "DestroyJavaVM" prio=10 tid=0x00007f4edc009800 nid=0x2108 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "LeaseChecker" daemon prio=10 tid=0x00007f4edc7e6800 nid=0x211c waiting on condition [0x00007f4e62fe0000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:1476) at java.lang.Thread.run(Thread.java:619) "LRU Statistics #0" prio=10 tid=0x00007f4edc7dd800 nid=0x211a waiting on condition [0x00007f4e631e2000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007f4e88acc968> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025) at java.util.concurrent.DelayQueue.take(DelayQueue.java:164) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:583) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:576) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) at java.lang.Thread.run(Thread.java:619) "LruBlockCache.EvictionThread" daemon prio=10 tid=0x00007f4edc7d8800 nid=0x2119 in Object.wait() [0x00007f4e632e3000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00007f4e88b1e570> (a org.apache.hadoop.hbase.io.hfile.LruBlockCache$EvictionThread) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.hbase.io.hfile.LruBlockCache$EvictionThread.run(LruBlockCache.java:568) - locked <0x00007f4e88b1e570> (a org.apache.hadoop.hbase.io.hfile.LruBlockCache$EvictionThread) at java.lang.Thread.run(Thread.java:619) "Low Memory Detector" daemon prio=10 tid=0x00007f4edc0ff000 nid=0x2113 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE "CompilerThread1" daemon prio=10 tid=0x00007f4edc0fd000 nid=0x2112 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "CompilerThread0" daemon prio=10 tid=0x00007f4edc0fa800 nid=0x2111 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Signal Dispatcher" daemon prio=10 tid=0x00007f4edc0f8800 nid=0x2110 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Surrogate Locker Thread (CMS)" daemon prio=10 tid=0x00007f4edc0f6800 nid=0x210f waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Finalizer" daemon prio=10 tid=0x00007f4edc0d8800 nid=0x210e in Object.wait() [0x00007f4ed031a000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00007f4e896b0620> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) - locked <0x00007f4e896b0620> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) "Reference Handler" daemon prio=10 tid=0x00007f4edc0d6800 nid=0x210d in Object.wait() [0x00007f4ed041b000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00007f4e896b0700> (a java.lang.ref.Reference$Lock) at java.lang.Object.wait(Object.java:485) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) - locked <0x00007f4e896b0700> (a java.lang.ref.Reference$Lock) "VM Thread" prio=10 tid=0x00007f4edc0d2000 nid=0x210c runnable "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x00007f4edc018000 nid=0x2109 runnable "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x00007f4edc019800 nid=0x210a runnable "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x00007f4edc078000 nid=0x210b runnable "VM Periodic Task Thread" prio=10 tid=0x00007f4edc10a000 nid=0x2114 waiting on condition JNI global references: 1268 Does this ring a bell, is this a problem we should pursue? Thanks, Roman.
-
Re: HBase 0.92/Hadoop 0.22 test resultsStack 2011-11-03, 22:48
On Thu, Nov 3, 2011 at 3:34 PM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote:
> So here's the run after I resolved all the set up issues: > http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/14/testReport/ > I see this too: Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.PlatformName Is that not in hadoop 0.22? St.Ack P.S. Thanks for doing this. > Here's what I see timing out: > http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/14/testReport/org.apache.bigtop.itest.hbase.smoke/TestHBaseCompression/testNoCompression/ > http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/14/testReport/org.apache.bigtop.itest.hbase.smoke/TestHBaseCompression/testGzipCompression/ > > Which is basically simply: > $ hbase org.apache.hadoop.hbase.util.CompressionTest > hdfs://ip-10-32-33-167.ec2.internal:17020/user/root/snappy-output/testfile.gz > gz > $ hbase org.apache.hadoop.hbase.util.CompressionTest > hdfs://ip-10-32-33-167.ec2.internal:17020/user/root/snappy-output/testfile.none > none > > And yes, they are hanging when executed by hand as well. Which is > weird too, since the > test itself actually completes, and exits > org.apache.hadoop.hbase.util.CompressionTest.main > and then everything freezes over with the following stack trace: > > $ jstack 8412 > 2011-11-03 18:33:37 > Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.0-b16 mixed mode): > > "Attach Listener" daemon prio=10 tid=0x00007f4e6c001000 nid=0x213a > waiting on condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > "DestroyJavaVM" prio=10 tid=0x00007f4edc009800 nid=0x2108 waiting on > condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > "LeaseChecker" daemon prio=10 tid=0x00007f4edc7e6800 nid=0x211c > waiting on condition [0x00007f4e62fe0000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:1476) > at java.lang.Thread.run(Thread.java:619) > > "LRU Statistics #0" prio=10 tid=0x00007f4edc7dd800 nid=0x211a waiting > on condition [0x00007f4e631e2000] > java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00007f4e88acc968> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) > at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025) > at java.util.concurrent.DelayQueue.take(DelayQueue.java:164) > at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:583) > at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:576) > at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) > at java.lang.Thread.run(Thread.java:619) > > "LruBlockCache.EvictionThread" daemon prio=10 tid=0x00007f4edc7d8800 > nid=0x2119 in Object.wait() [0x00007f4e632e3000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00007f4e88b1e570> (a > org.apache.hadoop.hbase.io.hfile.LruBlockCache$EvictionThread) > at java.lang.Object.wait(Object.java:485) > at org.apache.hadoop.hbase.io.hfile.LruBlockCache$EvictionThread.run(LruBlockCache.java:568) > - locked <0x00007f4e88b1e570> (a > org.apache.hadoop.hbase.io.hfile.LruBlockCache$EvictionThread) > at java.lang.Thread.run(Thread.java:619) > > "Low Memory Detector" daemon prio=10 tid=0x00007f4edc0ff000 nid=0x2113 > runnable [0x0000000000000000] > java.lang.Thread.State: RUNNABLE
-
Re: HBase 0.92/Hadoop 0.22 test resultsTed Yu 2011-11-03, 22:52
Copying Konstantin.
On Thu, Nov 3, 2011 at 3:48 PM, Stack <[EMAIL PROTECTED]> wrote: > On Thu, Nov 3, 2011 at 3:34 PM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote: > > So here's the run after I resolved all the set up issues: > > > http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/14/testReport/ > > > > I see this too: > > Caused by: java.lang.ClassNotFoundException: > org.apache.hadoop.util.PlatformName > > Is that not in hadoop 0.22? > > St.Ack > P.S. Thanks for doing this. > > > > > Here's what I see timing out: > > > http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/14/testReport/org.apache.bigtop.itest.hbase.smoke/TestHBaseCompression/testNoCompression/ > > > http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/14/testReport/org.apache.bigtop.itest.hbase.smoke/TestHBaseCompression/testGzipCompression/ > > > > Which is basically simply: > > $ hbase org.apache.hadoop.hbase.util.CompressionTest > > > hdfs://ip-10-32-33-167.ec2.internal:17020/user/root/snappy-output/testfile.gz > > gz > > $ hbase org.apache.hadoop.hbase.util.CompressionTest > > > hdfs://ip-10-32-33-167.ec2.internal:17020/user/root/snappy-output/testfile.none > > none > > > > And yes, they are hanging when executed by hand as well. Which is > > weird too, since the > > test itself actually completes, and exits > > org.apache.hadoop.hbase.util.CompressionTest.main > > and then everything freezes over with the following stack trace: > > > > $ jstack 8412 > > 2011-11-03 18:33:37 > > Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.0-b16 mixed mode): > > > > "Attach Listener" daemon prio=10 tid=0x00007f4e6c001000 nid=0x213a > > waiting on condition [0x0000000000000000] > > java.lang.Thread.State: RUNNABLE > > > > "DestroyJavaVM" prio=10 tid=0x00007f4edc009800 nid=0x2108 waiting on > > condition [0x0000000000000000] > > java.lang.Thread.State: RUNNABLE > > > > "LeaseChecker" daemon prio=10 tid=0x00007f4edc7e6800 nid=0x211c > > waiting on condition [0x00007f4e62fe0000] > > java.lang.Thread.State: TIMED_WAITING (sleeping) > > at java.lang.Thread.sleep(Native Method) > > at > org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:1476) > > at java.lang.Thread.run(Thread.java:619) > > > > "LRU Statistics #0" prio=10 tid=0x00007f4edc7dd800 nid=0x211a waiting > > on condition [0x00007f4e631e2000] > > java.lang.Thread.State: TIMED_WAITING (parking) > > at sun.misc.Unsafe.park(Native Method) > > - parking to wait for <0x00007f4e88acc968> (a > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) > > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025) > > at java.util.concurrent.DelayQueue.take(DelayQueue.java:164) > > at > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:583) > > at > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:576) > > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) > > at java.lang.Thread.run(Thread.java:619) > > > > "LruBlockCache.EvictionThread" daemon prio=10 tid=0x00007f4edc7d8800 > > nid=0x2119 in Object.wait() [0x00007f4e632e3000] > > java.lang.Thread.State: WAITING (on object monitor) > > at java.lang.Object.wait(Native Method) > > - waiting on <0x00007f4e88b1e570> (a > > org.apache.hadoop.hbase.io.hfile.LruBlockCache$EvictionThread) > > at java.lang.Object.wait(Object.java:485) > > at > org.apache.hadoop.hbase.io.hfile.LruBlockCache$EvictionThread.run(LruBlockCache.java:568) > > - locked <0x00007f4e88b1e570> (a
-
Re: HBase 0.92/Hadoop 0.22 test resultsRoman Shaposhnik 2011-11-03, 23:01
On Thu, Nov 3, 2011 at 3:48 PM, Stack <[EMAIL PROTECTED]> wrote:
> On Thu, Nov 3, 2011 at 3:34 PM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote: >> So here's the run after I resolved all the set up issues: >> http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/14/testReport/ >> > > I see this too: > > Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.PlatformName That's an unfortunate red-herring that should be fixed now that you've committed HBASE-4719 (thanks, btw!). I don't think it affects the test execution in any way. In fact, let me rebuild Bigtop .22/.92 with the latest trunk just to cut down on confusion. Thanks, Roman.
-
Re: HBase 0.92/Hadoop 0.22 test resultsAndrew Purtell 2011-11-03, 23:15
""LRU Statistics #0" prio=10 tid=0x00007f4edc7dd800 nid=0x211a waitingon condition [0x00007f4e631e2000]" is not a daemon thread, but should be?
- Andy ----- Original Message ----- > From: Roman Shaposhnik <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Cc: > Sent: Thursday, November 3, 2011 3:34 PM > Subject: HBase 0.92/Hadoop 0.22 test results > > So here's the run after I resolved all the set up issues: > > http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/14/testReport/ > > Here's what I see timing out: > > http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/14/testReport/org.apache.bigtop.itest.hbase.smoke/TestHBaseCompression/testNoCompression/ > > http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/14/testReport/org.apache.bigtop.itest.hbase.smoke/TestHBaseCompression/testGzipCompression/ > > Which is basically simply: > $ hbase org.apache.hadoop.hbase.util.CompressionTest > hdfs://ip-10-32-33-167.ec2.internal:17020/user/root/snappy-output/testfile.gz > gz > $ hbase org.apache.hadoop.hbase.util.CompressionTest > hdfs://ip-10-32-33-167.ec2.internal:17020/user/root/snappy-output/testfile.none > none > > And yes, they are hanging when executed by hand as well. Which is > weird too, since the > test itself actually completes, and exits > org.apache.hadoop.hbase.util.CompressionTest.main > and then everything freezes over with the following stack trace: > > $ jstack 8412 > 2011-11-03 18:33:37 > Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.0-b16 mixed mode): > > "Attach Listener" daemon prio=10 tid=0x00007f4e6c001000 nid=0x213a > waiting on condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > "DestroyJavaVM" prio=10 tid=0x00007f4edc009800 nid=0x2108 waiting on > condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > "LeaseChecker" daemon prio=10 tid=0x00007f4edc7e6800 nid=0x211c > waiting on condition [0x00007f4e62fe0000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:1476) > at java.lang.Thread.run(Thread.java:619) > > "LRU Statistics #0" prio=10 tid=0x00007f4edc7dd800 nid=0x211a waiting > on condition [0x00007f4e631e2000] > java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00007f4e88acc968> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025) > at java.util.concurrent.DelayQueue.take(DelayQueue.java:164) > at > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:583) > at > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:576) > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) > at java.lang.Thread.run(Thread.java:619) > > "LruBlockCache.EvictionThread" daemon prio=10 tid=0x00007f4edc7d8800 > nid=0x2119 in Object.wait() [0x00007f4e632e3000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00007f4e88b1e570> (a > org.apache.hadoop.hbase.io.hfile.LruBlockCache$EvictionThread) > at java.lang.Object.wait(Object.java:485) > at > org.apache.hadoop.hbase.io.hfile.LruBlockCache$EvictionThread.run(LruBlockCache.java:568) > - locked <0x00007f4e88b1e570> (a > org.apache.hadoop.hbase.io.hfile.LruBlockCache$EvictionThread) > at java.lang.Thread.run(Thread.java:619) > > "Low Memory Detector" daemon prio=10 tid=0x00007f4edc0ff000 nid=0x2113
-
Re: HBase 0.92/Hadoop 0.22 test resultsTed Yu 2011-11-03, 23:37
HBASE-4745 has been logged.
On Thu, Nov 3, 2011 at 4:15 PM, Andrew Purtell <[EMAIL PROTECTED]> wrote: > ""LRU Statistics #0" prio=10 tid=0x00007f4edc7dd800 nid=0x211a waitingon > condition [0x00007f4e631e2000]" is not a daemon thread, but should be? > > - Andy > > > > > > ----- Original Message ----- > > From: Roman Shaposhnik <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED] > > Cc: > > Sent: Thursday, November 3, 2011 3:34 PM > > Subject: HBase 0.92/Hadoop 0.22 test results > > > > So here's the run after I resolved all the set up issues: > > > > > http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/14/testReport/ > > > > Here's what I see timing out: > > > > > http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/14/testReport/org.apache.bigtop.itest.hbase.smoke/TestHBaseCompression/testNoCompression/ > > > > > http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/14/testReport/org.apache.bigtop.itest.hbase.smoke/TestHBaseCompression/testGzipCompression/ > > > > Which is basically simply: > > $ hbase org.apache.hadoop.hbase.util.CompressionTest > > > hdfs://ip-10-32-33-167.ec2.internal:17020/user/root/snappy-output/testfile.gz > > gz > > $ hbase org.apache.hadoop.hbase.util.CompressionTest > > > hdfs://ip-10-32-33-167.ec2.internal:17020/user/root/snappy-output/testfile.none > > none > > > > And yes, they are hanging when executed by hand as well. Which is > > weird too, since the > > test itself actually completes, and exits > > org.apache.hadoop.hbase.util.CompressionTest.main > > and then everything freezes over with the following stack trace: > > > > $ jstack 8412 > > 2011-11-03 18:33:37 > > Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.0-b16 mixed mode): > > > > "Attach Listener" daemon prio=10 tid=0x00007f4e6c001000 nid=0x213a > > waiting on condition [0x0000000000000000] > > java.lang.Thread.State: RUNNABLE > > > > "DestroyJavaVM" prio=10 tid=0x00007f4edc009800 nid=0x2108 waiting on > > condition [0x0000000000000000] > > java.lang.Thread.State: RUNNABLE > > > > "LeaseChecker" daemon prio=10 tid=0x00007f4edc7e6800 nid=0x211c > > waiting on condition [0x00007f4e62fe0000] > > java.lang.Thread.State: TIMED_WAITING (sleeping) > > at java.lang.Thread.sleep(Native Method) > > at > org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:1476) > > at java.lang.Thread.run(Thread.java:619) > > > > "LRU Statistics #0" prio=10 tid=0x00007f4edc7dd800 nid=0x211a waiting > > on condition [0x00007f4e631e2000] > > java.lang.Thread.State: TIMED_WAITING (parking) > > at sun.misc.Unsafe.park(Native Method) > > - parking to wait for <0x00007f4e88acc968> (a > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) > > at > > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025) > > at java.util.concurrent.DelayQueue.take(DelayQueue.java:164) > > at > > > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:583) > > at > > > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:576) > > at > > > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947) > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) > > at java.lang.Thread.run(Thread.java:619) > > > > "LruBlockCache.EvictionThread" daemon prio=10 tid=0x00007f4edc7d8800 > > nid=0x2119 in Object.wait() [0x00007f4e632e3000] > > java.lang.Thread.State: WAITING (on object monitor) > > at java.lang.Object.wait(Native Method) > > - waiting on <0x00007f4e88b1e570> (a > > org.apache.hadoop.hbase.io.hfile.LruBlockCache$EvictionThread) > > at java.lang.Object.wait(Object.java:485) > > at
-
RE: HBase 0.92/Hadoop 0.22 test resultsShvachko, Konstantin 2011-11-04, 00:35
org.apache.hadoop.util.PlatformName is there in common.
Is there a problem with jars that are used in HBase or a problem with jar generation? I did jar -tvf hadoop-common-0.22.0-SNAPSHOT.jar | grep PlatformName org/apache/hadoop/util/PlatformName.class So it should be in the build. --Konstantin From: Ted Yu [mailto:[EMAIL PROTECTED]] Sent: Thursday, November 03, 2011 3:53 PM To: [EMAIL PROTECTED]; Shvachko, Konstantin Subject: Re: HBase 0.92/Hadoop 0.22 test results Copying Konstantin. On Thu, Nov 3, 2011 at 3:48 PM, Stack <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: On Thu, Nov 3, 2011 at 3:34 PM, Roman Shaposhnik <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > So here's the run after I resolved all the set up issues: > http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/14/testReport/ > I see this too: Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.PlatformName Is that not in hadoop 0.22? St.Ack P.S. Thanks for doing this. > Here's what I see timing out: > http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/14/testReport/org.apache.bigtop.itest.hbase.smoke/TestHBaseCompression/testNoCompression/ > http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/14/testReport/org.apache.bigtop.itest.hbase.smoke/TestHBaseCompression/testGzipCompression/ > > Which is basically simply: > $ hbase org.apache.hadoop.hbase.util.CompressionTest > hdfs://ip-10-32-33-167.ec2.internal:17020/user/root/snappy-output/testfile.gz > gz > $ hbase org.apache.hadoop.hbase.util.CompressionTest > hdfs://ip-10-32-33-167.ec2.internal:17020/user/root/snappy-output/testfile.none > none > > And yes, they are hanging when executed by hand as well. Which is > weird too, since the > test itself actually completes, and exits > org.apache.hadoop.hbase.util.CompressionTest.main > and then everything freezes over with the following stack trace: > > $ jstack 8412 > 2011-11-03 18:33:37 > Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.0-b16 mixed mode): > > "Attach Listener" daemon prio=10 tid=0x00007f4e6c001000 nid=0x213a > waiting on condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > "DestroyJavaVM" prio=10 tid=0x00007f4edc009800 nid=0x2108 waiting on > condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > "LeaseChecker" daemon prio=10 tid=0x00007f4edc7e6800 nid=0x211c > waiting on condition [0x00007f4e62fe0000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:1476) > at java.lang.Thread.run(Thread.java:619) > > "LRU Statistics #0" prio=10 tid=0x00007f4edc7dd800 nid=0x211a waiting > on condition [0x00007f4e631e2000] > java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00007f4e88acc968> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) > at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025) > at java.util.concurrent.DelayQueue.take(DelayQueue.java:164) > at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:583) > at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:576) > at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) > at java.lang.Thread.run(Thread.java:619) > > "LruBlockCache.EvictionThread" daemon prio=10 tid=0x00007f4edc7d8800 > nid=0x2119 in Object.wait() [0x00007f4e632e3000] > java.lang.Thread.State: WAITING (on object monitor)
-
Re: HBase 0.92/Hadoop 0.22 test resultsStack 2011-11-05, 04:02
On Thu, Nov 3, 2011 at 4:37 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> HBASE-4745 has been logged. > > On Thu, Nov 3, 2011 at 4:15 PM, Andrew Purtell <[EMAIL PROTECTED]> wrote: > >> ""LRU Statistics #0" prio=10 tid=0x00007f4edc7dd800 nid=0x211a waitingon >> condition [0x00007f4e631e2000]" is not a daemon thread, but should be? >> FYI, the boys fixed hbase-4745 in TRUNK/0.92 branch. St.Ack
-
Re: HBase 0.92/Hadoop 0.22 test resultsRoman Shaposhnik 2011-11-05, 23:28
On Fri, Nov 4, 2011 at 9:02 PM, Stack <[EMAIL PROTECTED]> wrote:
> FYI, the boys fixed hbase-4745 in TRUNK/0.92 branch. Great! I've deployed a cluster from the 0.92 head and all of a sudden started to see the following issue: any attempt at table creation generates the following in the logs: 11/11/05 19:08:48 INFO handler.CreateTableHandler: Attemping to create the table b 11/11/05 19:08:48 ERROR handler.CreateTableHandler: Error trying to create the table b java.io.FileNotFoundException: File hdfs://ip-10-110-254-200.ec2.internal:17020/hbase/b does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:387) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1085) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1110) at org.apache.hadoop.hbase.util.FSTableDescriptors.getTableInfoPath(FSTableDescriptors.java:257) at org.apache.hadoop.hbase.util.FSTableDescriptors.getTableInfoPath(FSTableDescriptors.java:243) at org.apache.hadoop.hbase.util.FSTableDescriptors.createTableDescriptor(FSTableDescriptors.java:566) at org.apache.hadoop.hbase.util.FSTableDescriptors.createTableDescriptor(FSTableDescriptors.java:535) at org.apache.hadoop.hbase.util.FSTableDescriptors.createTableDescriptor(FSTableDescriptors.java:519) at org.apache.hadoop.hbase.master.handler.CreateTableHandler.handleCreateTable(CreateTableHandler.java:140) at org.apache.hadoop.hbase.master.handler.CreateTableHandler.process(CreateTableHandler.java:126) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:168) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Here's ZK perspective on that same table: [zk: localhost(CONNECTED) 4] get /hbase/table/b �#[EMAIL PROTECTED]rnalENABLING In general, master has no troubles writing to HDFS. I see /hbase/-ROOT-/ and /hbase/.META. and the usual stuff. On top of that it doesn't seem to be HDFS specific at all. Running HBASE in a standalone mode produces the following: 11/11/05 19:27:09 INFO handler.CreateTableHandler: Attemping to create the table h 11/11/05 19:27:09 ERROR handler.CreateTableHandler: Error trying to create the table h java.io.FileNotFoundException: File file:/tmp/hbase-hbase/hbase/h does not exist. Any ideas on what could be going wrong? Thanks, Roman.
-
Re: HBase 0.92/Hadoop 0.22 test resultsTed Yu 2011-11-06, 00:36
Yesterday at around 3pm I deployed TRUNK to a five node test cluster.
I verified that I could create table. Among the JIRAs integrated after that, HBASE-4553<https://issues.apache.org/jira/browse/HBASE-4553>is a possible source that might have caused this regression. Cheers On Sat, Nov 5, 2011 at 4:28 PM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote: > On Fri, Nov 4, 2011 at 9:02 PM, Stack <[EMAIL PROTECTED]> wrote: > > FYI, the boys fixed hbase-4745 in TRUNK/0.92 branch. > > Great! I've deployed a cluster from the 0.92 head and all of a sudden > started to see the following issue: any attempt at table creation > generates the following in the logs: > > 11/11/05 19:08:48 INFO handler.CreateTableHandler: Attemping to create > the table b > 11/11/05 19:08:48 ERROR handler.CreateTableHandler: Error trying to > create the table b > java.io.FileNotFoundException: File > hdfs://ip-10-110-254-200.ec2.internal:17020/hbase/b does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:387) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1085) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1110) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableInfoPath(FSTableDescriptors.java:257) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableInfoPath(FSTableDescriptors.java:243) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.createTableDescriptor(FSTableDescriptors.java:566) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.createTableDescriptor(FSTableDescriptors.java:535) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.createTableDescriptor(FSTableDescriptors.java:519) > at > org.apache.hadoop.hbase.master.handler.CreateTableHandler.handleCreateTable(CreateTableHandler.java:140) > at > org.apache.hadoop.hbase.master.handler.CreateTableHandler.process(CreateTableHandler.java:126) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:168) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > > Here's ZK perspective on that same table: > [zk: localhost(CONNECTED) 4] get /hbase/table/b > �#[EMAIL PROTECTED]rnalENABLING > > In general, master has no troubles writing to HDFS. I see > /hbase/-ROOT-/ and /hbase/.META. and the usual stuff. On top > of that it doesn't seem to be HDFS specific at all. Running HBASE > in a standalone mode produces the following: > > 11/11/05 19:27:09 INFO handler.CreateTableHandler: Attemping to create > the table h > 11/11/05 19:27:09 ERROR handler.CreateTableHandler: Error trying to > create the table h > java.io.FileNotFoundException: File file:/tmp/hbase-hbase/hbase/h does > not exist. > > Any ideas on what could be going wrong? > > Thanks, > Roman. >
-
Re: HBase 0.92/Hadoop 0.22 test resultsStack 2011-11-06, 22:33
On Sat, Nov 5, 2011 at 4:28 PM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote:
> On Fri, Nov 4, 2011 at 9:02 PM, Stack <[EMAIL PROTECTED]> wrote: >> FYI, the boys fixed hbase-4745 in TRUNK/0.92 branch. > > Great! I've deployed a cluster from the 0.92 head and all of a sudden > started to see the following issue: any attempt at table creation > generates the following in the logs: > Odd. I just tried tip of 0.92 as of now and both in local mode and up on an hdfs cluster I can create tables fine. Send over more log Roman. Anything in namenode logs about perms or creating files in dirs that have not yet been created? St.Ack
-
Re: HBase 0.92/Hadoop 0.22 test resultsRoman Shaposhnik 2011-11-07, 00:12
On Sun, Nov 6, 2011 at 2:33 PM, Stack <[EMAIL PROTECTED]> wrote:
> On Sat, Nov 5, 2011 at 4:28 PM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote: >> On Fri, Nov 4, 2011 at 9:02 PM, Stack <[EMAIL PROTECTED]> wrote: >>> FYI, the boys fixed hbase-4745 in TRUNK/0.92 branch. >> >> Great! I've deployed a cluster from the 0.92 head and all of a sudden >> started to see the following issue: any attempt at table creation >> generates the following in the logs: >> > > Odd. I just tried tip of 0.92 as of now and both in local mode and up > on an hdfs cluster I can create tables fine. Send over more log > Roman. Anything in namenode logs about perms or creating files in > dirs that have not yet been created? Odd indeed. Whatever it was is now gone when I build from this SHA: 61b5659bf7971cfac32f3cf4fca0d3823b4c8f8c However, I can still reproduce it when I build from the previous SHA: 454a75d2eb122b198140a778d00d6e1bc086517e I think since it got fixed, it is probably not really worth pursuing. Thanks, Roman.
-
Re: HBase 0.92/Hadoop 0.22 test resultsRoman Shaposhnik 2011-11-07, 02:38
On Sun, Nov 6, 2011 at 4:12 PM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote:
> Odd indeed. Whatever it was is now gone when I build from this SHA: > 61b5659bf7971cfac32f3cf4fca0d3823b4c8f8c > However, I can still reproduce it when I build from the previous SHA: > 454a75d2eb122b198140a778d00d6e1bc086517e > > I think since it got fixed, it is probably not really worth pursuing. Here's the final deal -- this is Hadoop 0.22 related. I can reliably reproduce it if I enable the .22 profile. Here's how: $ git pull ; git checkout remotes/origin/0.92 $ mvn clean assembly:assembly -DskipTests -Dhadoop.profile=22 $ tar xzvf -C /tmp/22 target/hbase-0.92.0-SNAPSHOT.tar.gz $ rm -rf /tmp/hbase* $ /tmp/22/hbase-0.92.0-SNAPSHOT/hbase-daemon.sh start master $ /tmp/22/hbase-0.92.0-SNAPSHOT/hbase shell 11/11/06 18:13:50 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available 11/11/06 18:13:50 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available 11/11/06 18:13:50 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available 11/11/06 18:13:50 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.92.0-SNAPSHOT, r61b5659bf7971cfac32f3cf4fca0d3823b4c8f8c, Sun Nov 6 18:02:26 PST 2011 hbase(main):001:0> create 't', 'f' And it hangs. I'm about to attend a social function in the next couple of hours and will probably dig further tomorrow at ApacheCON. Thanks, Roman.
-
Re: HBase 0.92/Hadoop 0.22 test resultsTed Yu 2011-11-07, 05:00
I dug through the code a little bit. Indeed the following exception was due
to the difference in DistributedFileSystem.listStatus() between 0.20.205 and 0.22: 11/11/05 19:08:48 ERROR handler.CreateTableHandler: Error trying to create the table b java.io.FileNotFoundException: File hdfs://ip-10-110-254-200.ec2. internal:17020/hbase/b does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:387) In 0.20.205: public FileStatus[] listStatus(Path p) throws IOException { String src = getPathName(p); // fetch the first batch of entries in the directory DirectoryListing thisListing = dfs.listPaths( src, HdfsFileStatus.EMPTY_NAME); if (thisListing == null) { // the directory does not exist return null; } In 0.22: @Override public FileStatus[] listStatus(Path p) throws IOException { String src = getPathName(p); // fetch the first batch of entries in the directory DirectoryListing thisListing = dfs.listPaths( src, HdfsFileStatus.EMPTY_NAME); if (thisListing == null) { // the directory does not exist throw new FileNotFoundException("File " + p + " does not exist."); } So in FSTableDescriptors.getTableInfoPath(), we should catch FileNotFoundException and treat it the same way as status being null. Cheers On Sun, Nov 6, 2011 at 6:38 PM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote: > On Sun, Nov 6, 2011 at 4:12 PM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote: > > Odd indeed. Whatever it was is now gone when I build from this SHA: > > 61b5659bf7971cfac32f3cf4fca0d3823b4c8f8c > > However, I can still reproduce it when I build from the previous SHA: > > 454a75d2eb122b198140a778d00d6e1bc086517e > > > > I think since it got fixed, it is probably not really worth pursuing. > > Here's the final deal -- this is Hadoop 0.22 related. I can reliably > reproduce > it if I enable the .22 profile. Here's how: > $ git pull ; git checkout remotes/origin/0.92 > $ mvn clean assembly:assembly -DskipTests -Dhadoop.profile=22 > $ tar xzvf -C /tmp/22 target/hbase-0.92.0-SNAPSHOT.tar.gz > $ rm -rf /tmp/hbase* > $ /tmp/22/hbase-0.92.0-SNAPSHOT/hbase-daemon.sh start master > $ /tmp/22/hbase-0.92.0-SNAPSHOT/hbase shell > 11/11/06 18:13:50 WARN conf.Configuration: hadoop.native.lib is > deprecated. Instead, use io.native.lib.available > 11/11/06 18:13:50 WARN conf.Configuration: hadoop.native.lib is > deprecated. Instead, use io.native.lib.available > 11/11/06 18:13:50 WARN conf.Configuration: hadoop.native.lib is > deprecated. Instead, use io.native.lib.available > 11/11/06 18:13:50 WARN conf.Configuration: hadoop.native.lib is > deprecated. Instead, use io.native.lib.available > HBase Shell; enter 'help<RETURN>' for list of supported commands. > Type "exit<RETURN>" to leave the HBase Shell > Version 0.92.0-SNAPSHOT, r61b5659bf7971cfac32f3cf4fca0d3823b4c8f8c, > Sun Nov 6 18:02:26 PST 2011 > > hbase(main):001:0> create 't', 'f' > > And it hangs. > > I'm about to attend a social function in the next couple of hours and > will probably > dig further tomorrow at ApacheCON. > > Thanks, > Roman. >
-
Re: HBase 0.92/Hadoop 0.22 test resultsTed Yu 2011-11-07, 05:54
HBASE-4754 has been filed.
FYI On Sun, Nov 6, 2011 at 9:00 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > I dug through the code a little bit. Indeed the following exception was > due to the difference in DistributedFileSystem.listStatus() between > 0.20.205 and 0.22: > > > 11/11/05 19:08:48 ERROR handler.CreateTableHandler: Error trying to > create the table b > java.io.FileNotFoundException: File > hdfs://ip-10-110-254-200.ec2. > internal:17020/hbase/b does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:387) > > In 0.20.205: > public FileStatus[] listStatus(Path p) throws IOException { > String src = getPathName(p); > > // fetch the first batch of entries in the directory > DirectoryListing thisListing = dfs.listPaths( > src, HdfsFileStatus.EMPTY_NAME); > > if (thisListing == null) { // the directory does not exist > return null; > } > > In 0.22: > @Override > public FileStatus[] listStatus(Path p) throws IOException { > String src = getPathName(p); > > // fetch the first batch of entries in the directory > DirectoryListing thisListing = dfs.listPaths( > src, HdfsFileStatus.EMPTY_NAME); > > if (thisListing == null) { // the directory does not exist > throw new FileNotFoundException("File " + p + " does not exist."); > } > > So in FSTableDescriptors.getTableInfoPath(), we should catch > FileNotFoundException and treat it the same way as status being null. > > Cheers > > > On Sun, Nov 6, 2011 at 6:38 PM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote: > >> On Sun, Nov 6, 2011 at 4:12 PM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote: >> > Odd indeed. Whatever it was is now gone when I build from this SHA: >> > 61b5659bf7971cfac32f3cf4fca0d3823b4c8f8c >> > However, I can still reproduce it when I build from the previous SHA: >> > 454a75d2eb122b198140a778d00d6e1bc086517e >> > >> > I think since it got fixed, it is probably not really worth pursuing. >> >> Here's the final deal -- this is Hadoop 0.22 related. I can reliably >> reproduce >> it if I enable the .22 profile. Here's how: >> $ git pull ; git checkout remotes/origin/0.92 >> $ mvn clean assembly:assembly -DskipTests -Dhadoop.profile=22 >> $ tar xzvf -C /tmp/22 target/hbase-0.92.0-SNAPSHOT.tar.gz >> $ rm -rf /tmp/hbase* >> $ /tmp/22/hbase-0.92.0-SNAPSHOT/hbase-daemon.sh start master >> $ /tmp/22/hbase-0.92.0-SNAPSHOT/hbase shell >> 11/11/06 18:13:50 WARN conf.Configuration: hadoop.native.lib is >> deprecated. Instead, use io.native.lib.available >> 11/11/06 18:13:50 WARN conf.Configuration: hadoop.native.lib is >> deprecated. Instead, use io.native.lib.available >> 11/11/06 18:13:50 WARN conf.Configuration: hadoop.native.lib is >> deprecated. Instead, use io.native.lib.available >> 11/11/06 18:13:50 WARN conf.Configuration: hadoop.native.lib is >> deprecated. Instead, use io.native.lib.available >> HBase Shell; enter 'help<RETURN>' for list of supported commands. >> Type "exit<RETURN>" to leave the HBase Shell >> Version 0.92.0-SNAPSHOT, r61b5659bf7971cfac32f3cf4fca0d3823b4c8f8c, >> Sun Nov 6 18:02:26 PST 2011 >> >> hbase(main):001:0> create 't', 'f' >> >> And it hangs. >> >> I'm about to attend a social function in the next couple of hours and >> will probably >> dig further tomorrow at ApacheCON. >> >> Thanks, >> Roman. >> > >
-
Re: HBase 0.92/Hadoop 0.22 test resultsRoman Shaposhnik 2011-11-08, 06:16
With HBASE-4754 fix in place I can get further in my testing,
but it still fails :-( Here's how it does it this time. It loads OK, but then when it needs to split here's what happens: 11/11/08 00:44:30 INFO handler.ServerShutdownHandler: Splitting logs for ip-10-114-225-185.ec2.internal,60020,1320726988138 11/11/08 00:44:30 INFO master.SplitLogManager: dead splitlog worker ip-10-114-225-185.ec2.internal,60020,1320726988138 11/11/08 00:44:30 INFO master.SplitLogManager: started splitting logs in [hdfs://ip-10-84-202-94.ec2.internal:17020/hbase/.logs/ip-10-114-225-185.ec2.internal,60020,1320726988138-splitting] 11/11/08 00:44:31 ERROR master.HMaster: Region server ^@^@ip-10-114-225-185.ec2.internal,60020,1320726988138 reported a fatal error: ABORTING region server ip-10-114-225-185.ec2.internal,60020,1320726988138: Unhandled exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing ip-10-114-225-185.ec2.internal,60020,1320726988138 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:222) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:148) at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:750) at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1306) That's on the master side, on the regionserver side, it looks really weird. It basically hums along doing the split and then at some point, there's this: 11/11/08 00:43:40 INFO regionserver.Store: Added hdfs://ip-10-84-202-94.ec2.internal:17020/hbase/TestLoadAndVerify_1320729464658/8bd8387431feec2b09983693dfac950b/f1/4fc67a93e580402190b5c8a72820f665, entries=82049, sequenceid=142942, memsize=18.1m, filesize=4.4m 11/11/08 00:43:40 INFO regionserver.HRegion: Finished memstore flush of ~18.4m for region TestLoadAndVerify_1320729464658,<\xA1\xAF(k\xCA\x1A\xEA,1320729465485.8bd8387431feec2b09983693dfac950b. in 829ms, sequenceid=142942, compaction requested=false 11/11/08 00:44:31 INFO zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x133817270190001, likely server has closed socket, closing socket connection and attempting reconnect 11/11/08 00:44:31 INFO zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x133817270190004, likely server has closed socket, closing socket connection and attempting reconnect 11/11/08 00:44:31 WARN util.Sleeper: We slept 38891ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9 11/11/08 00:44:31 FATAL regionserver.HRegionServer: ABORTING region server ip-10-114-225-185.ec2.internal,60020,1320726988138: Unhandled exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing ip-10-114-225-185.ec2.internal,60020,1320726988138 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:222) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:148) at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:750) at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1306) Thanks, Roman.
-
Re: HBase 0.92/Hadoop 0.22 test resultsRoman Shaposhnik 2011-11-08, 06:37
Forgot to add that from a master UI perspective here's where it is
stuck at: $ curl http://master:60010/master-status?format=json [{"statustimems":-1,"status":"Waiting for distributed tasks to finish. scheduled=5 done=0 error=0","starttimems":1320731070095,"description":"Doing distributed log split in [hdfs://ip-10-84-202-94.ec2.internal:17020/hbase/.logs/ip-10-114-225-185.ec2.internal,60020,1320726988138-splitting]","state":"RUNNING","statetimems":-1}] Regioserver finally dies and if I restart it manually the split seems to be finishing up as intended. Hope this helps. Thanks, Roman. On Mon, Nov 7, 2011 at 10:16 PM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote: > With HBASE-4754 fix in place I can get further in my testing, > but it still fails :-( > > Here's how it does it this time. It loads OK, but then when it > needs to split here's what happens: > > 11/11/08 00:44:30 INFO handler.ServerShutdownHandler: Splitting logs > for ip-10-114-225-185.ec2.internal,60020,1320726988138 > 11/11/08 00:44:30 INFO master.SplitLogManager: dead splitlog worker > ip-10-114-225-185.ec2.internal,60020,1320726988138 > 11/11/08 00:44:30 INFO master.SplitLogManager: started splitting logs > in [hdfs://ip-10-84-202-94.ec2.internal:17020/hbase/.logs/ip-10-114-225-185.ec2.internal,60020,1320726988138-splitting] > 11/11/08 00:44:31 ERROR master.HMaster: Region server > ^@^@ip-10-114-225-185.ec2.internal,60020,1320726988138 reported a > fatal error: > ABORTING region server > ip-10-114-225-185.ec2.internal,60020,1320726988138: Unhandled > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT > rejected; currently processing > ip-10-114-225-185.ec2.internal,60020,1320726988138 as dead server > at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:222) > at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:148) > at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:750) > at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364) > at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1306) > > That's on the master side, on the regionserver side, it looks really > weird. It basically hums along > doing the split and then at some point, there's this: > > 11/11/08 00:43:40 INFO regionserver.Store: Added > hdfs://ip-10-84-202-94.ec2.internal:17020/hbase/TestLoadAndVerify_1320729464658/8bd8387431feec2b09983693dfac950b/f1/4fc67a93e580402190b5c8a72820f665, > entries=82049, sequenceid=142942, memsize=18.1m, filesize=4.4m > 11/11/08 00:43:40 INFO regionserver.HRegion: Finished memstore flush > of ~18.4m for region > TestLoadAndVerify_1320729464658,<\xA1\xAF(k\xCA\x1A\xEA,1320729465485.8bd8387431feec2b09983693dfac950b. > in 829ms, sequenceid=142942, compaction requested=false > 11/11/08 00:44:31 INFO zookeeper.ClientCnxn: Unable to read additional > data from server sessionid 0x133817270190001, likely server has closed > socket, closing socket connection and attempting reconnect > 11/11/08 00:44:31 INFO zookeeper.ClientCnxn: Unable to read additional > data from server sessionid 0x133817270190004, likely server has closed > socket, closing socket connection and attempting reconnect > 11/11/08 00:44:31 WARN util.Sleeper: We slept 38891ms instead of > 3000ms, this is likely due to a long garbage collecting pause and it's > usually bad, see > http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9 > 11/11/08 00:44:31 FATAL regionserver.HRegionServer: ABORTING region > server ip-10-114-225-185.ec2.internal,60020,1320726988138: Unhandled > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT > rejected; currently processing > ip-10-114-225-185.ec2.internal,60020,1320726988138 as dead server
-
Re: HBase 0.92/Hadoop 0.22 test resultsTed Yu 2011-11-08, 17:20
Roman:
> 11/11/08 00:44:31 WARN util.Sleeper: We slept 38891ms instead of > 3000ms, this is likely due to a long garbage collecting pause and it's > usually bad, see 3000ms is the default value for hbase.regionserver.msginterval Obviously it is too short for the validation scenario. Can you increase its value and perform another round of test ? Thanks On Mon, Nov 7, 2011 at 10:37 PM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote: > Forgot to add that from a master UI perspective here's where it is > stuck at: > > $ curl http://master:60010/master-status?format=json > [{"statustimems":-1,"status":"Waiting for distributed tasks to finish. > scheduled=5 done=0 > error=0","starttimems":1320731070095,"description":"Doing distributed > log split in > [hdfs://ip-10-84-202-94.ec2.internal:17020/hbase/.logs/ip-10-114-225-185.ec2.internal,60020,1320726988138-splitting]","state":"RUNNING","statetimems":-1}] > > Regioserver finally dies and if I restart it manually the split seems to be > finishing up as intended. > > Hope this helps. > > Thanks, > Roman. > > On Mon, Nov 7, 2011 at 10:16 PM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote: > > With HBASE-4754 fix in place I can get further in my testing, > > but it still fails :-( > > > > Here's how it does it this time. It loads OK, but then when it > > needs to split here's what happens: > > > > 11/11/08 00:44:30 INFO handler.ServerShutdownHandler: Splitting logs > > for ip-10-114-225-185.ec2.internal,60020,1320726988138 > > 11/11/08 00:44:30 INFO master.SplitLogManager: dead splitlog worker > > ip-10-114-225-185.ec2.internal,60020,1320726988138 > > 11/11/08 00:44:30 INFO master.SplitLogManager: started splitting logs > > in > [hdfs://ip-10-84-202-94.ec2.internal:17020/hbase/.logs/ip-10-114-225-185.ec2.internal,60020,1320726988138-splitting] > > 11/11/08 00:44:31 ERROR master.HMaster: Region server > > ^@^@ip-10-114-225-185.ec2.internal,60020,1320726988138 reported a > > fatal error: > > ABORTING region server > > ip-10-114-225-185.ec2.internal,60020,1320726988138: Unhandled > > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT > > rejected; currently processing > > ip-10-114-225-185.ec2.internal,60020,1320726988138 as dead server > > at > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:222) > > at > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:148) > > at > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:750) > > at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at > org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364) > > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1306) > > > > That's on the master side, on the regionserver side, it looks really > > weird. It basically hums along > > doing the split and then at some point, there's this: > > > > 11/11/08 00:43:40 INFO regionserver.Store: Added > > > hdfs://ip-10-84-202-94.ec2.internal:17020/hbase/TestLoadAndVerify_1320729464658/8bd8387431feec2b09983693dfac950b/f1/4fc67a93e580402190b5c8a72820f665, > > entries=82049, sequenceid=142942, memsize=18.1m, filesize=4.4m > > 11/11/08 00:43:40 INFO regionserver.HRegion: Finished memstore flush > > of ~18.4m for region > > > TestLoadAndVerify_1320729464658,<\xA1\xAF(k\xCA\x1A\xEA,1320729465485.8bd8387431feec2b09983693dfac950b. > > in 829ms, sequenceid=142942, compaction requested=false > > 11/11/08 00:44:31 INFO zookeeper.ClientCnxn: Unable to read additional > > data from server sessionid 0x133817270190001, likely server has closed > > socket, closing socket connection and attempting reconnect > > 11/11/08 00:44:31 INFO zookeeper.ClientCnxn: Unable to read additional > > data from server sessionid 0x133817270190004, likely server has closed
-
Re: HBase 0.92/Hadoop 0.22 test resultsRoman Shaposhnik 2011-11-08, 17:26
On Tue, Nov 8, 2011 at 9:20 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
> Roman: >> 11/11/08 00:44:31 WARN util.Sleeper: We slept 38891ms instead of >> 3000ms, this is likely due to a long garbage collecting pause and it's >> usually bad, see > > 3000ms is the default value for hbase.regionserver.msginterval > Obviously it is too short for the validation scenario. > > Can you increase its value and perform another round of test ? Sure, but I have always thought 3000 was long enough for a tiny cluster. We're not talking hundreds of nodes here. Has something in HBase architecture changed so that this value now needs to be bumped? <property> <name>hbase.regionserver.msginterval</name> <value>1000</value> <description>Interval between messages from the RegionServer to HMaster in milliseconds. Default is 15. Set this value low if you want unit tests to be responsive. </description> </property> T hanks, Roman.
-
Re: HBase 0.92/Hadoop 0.22 test resultsTed Yu 2011-11-08, 17:33
> 11/11/08 00:43:40 INFO regionserver.HRegion: Finished memstore flush
> of ~18.4m for region > TestLoadAndVerify_ 1320729464658,<\xA1\xAF(k\xCA\x1A\xEA,1320729465485.8bd8387431feec2b09983693dfac950b. > in 829ms, sequenceid=142942, compaction requested=false > 11/11/08 00:44:31 INFO zookeeper.ClientCnxn: Unable to read additional > data from server sessionid 0x133817270190001, likely server has closed > socket, closing socket connection and attempting reconnect Is there a way to find out what could have led to the ~1min gap above ? Also, to help narrow our search, would HBase 0.92 + hadoop 0.20.205 produce the YouAreDeadException? Thanks On Tue, Nov 8, 2011 at 9:26 AM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote: > On Tue, Nov 8, 2011 at 9:20 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > > Roman: > >> 11/11/08 00:44:31 WARN util.Sleeper: We slept 38891ms instead of > >> 3000ms, this is likely due to a long garbage collecting pause and it's > >> usually bad, see > > > > 3000ms is the default value for hbase.regionserver.msginterval > > Obviously it is too short for the validation scenario. > > > > Can you increase its value and perform another round of test ? > > Sure, but I have always thought 3000 was long enough for a tiny > cluster. We're not talking hundreds of nodes here. Has something > in HBase architecture changed so that this value now needs to be > bumped? > <property> > <name>hbase.regionserver.msginterval</name> > <value>1000</value> > <description>Interval between messages from the RegionServer to HMaster > in milliseconds. Default is 15. Set this value low if you want unit > tests to be responsive. > </description> > </property> > T > hanks, > Roman. >
-
Re: HBase 0.92/Hadoop 0.22 test resultsStack 2011-11-08, 22:06
On Mon, Nov 7, 2011 at 10:16 PM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote:
> 11/11/08 00:43:40 INFO regionserver.HRegion: Finished memstore flush > of ~18.4m for region > TestLoadAndVerify_1320729464658,<\xA1\xAF(k\xCA\x1A\xEA,1320729465485.8bd8387431feec2b09983693dfac950b. > in 829ms, sequenceid=142942, compaction requested=false > 11/11/08 00:44:31 INFO zookeeper.ClientCnxn: Unable to read additional > data from server sessionid 0x133817270190001, likely server has closed > socket, closing socket connection and attempting reconnect > 11/11/08 00:44:31 INFO zookeeper.ClientCnxn: Unable to read additional > data from server sessionid 0x133817270190004, likely server has closed > socket, closing socket connection and attempting reconnect > 11/11/08 00:44:31 WARN util.Sleeper: We slept 38891ms instead of > 3000ms, this is likely due to a long garbage collecting pause and it's > usually bad, see > http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9 What happened above between 00:43:40 and 00:44:31? A big old GC? This is a standalone instance with all running in the on VM? The YouAreDeadException happens usually when the master has figured the RegionServer is dead before the RegionServer has figured it out. This can happen when say, the RS has GC paused and first thing it does when it comes out of the pause is it heartbeats the master (Meantime its probably running the zookeeper session expiration code concurrently). St.Ack
-
Re: HBase 0.92/Hadoop 0.22 test resultsRoman Shaposhnik 2011-11-08, 22:29
On Tue, Nov 8, 2011 at 2:06 PM, Stack <[EMAIL PROTECTED]> wrote:
> What happened above between 00:43:40 and 00:44:31? Not much judging by the logs. In fact that's part of the issue here I think. > A big old GC? Unlikely -- the RS had tons of Heap, but of course anything's possible. > This is a standalone instance with all running in the on VM? That's a small cluster running on EC2. So at the very fundamental levels these are VMs, yes. But for all practical purposes -- it is a fully distributed standalone set of servers. > The YouAreDeadException happens usually when the master has figured > the RegionServer is dead before the RegionServer has figured it out. > This can happen when say, the RS has GC paused and first thing it does > when it comes out of the pause is it heartbeats the master (Meantime > its probably running the zookeeper session expiration code > concurrently). Right. I'll try to look into that in my testing. I also bumped to the timeout up to a minute (which I'm really nervous about, though). Lets see... Thanks, Roman.
-
Re: HBase 0.92/Hadoop 0.22 test resultsRoman Shaposhnik 2011-11-09, 00:10
+Konstantin (there's something weird in append handling)
Some more updates. Hope this will help. I had this hunch that I was seeing those weird issues when HDFS DN was at 80% capacity (but nowhere near full!). So I quickly spun off a cluster that had 5 DNs with modest (and unbalanced!) amount of storage. Here's what started happening towards the end of loading 2M records into HBase: On the master: {"statustimems":-1,"status":"Waiting for distributed tasks to finish. scheduled=4 done=0 error=3","starttimems":1320796207862,"description":"Doing distributed log split in [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"RUNNING","statetimems":-1},{"statustimems":1320796275317,"status":"Waiting for distributed tasks to finish. scheduled=4 done=0 error=1","starttimems":1320796206563,"description":"Doing distributed log split in [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317},{"statustimems":1320796275317,"status":"Waiting for distributed tasks to finish. scheduled=4 done=0 error=2","starttimems":1320796205304,"description":"Doing distributed log split in [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317},{"statustimems":1320796275317,"status":"Waiting for distributed tasks to finish. scheduled=4 done=0 error=3","starttimems":1320796203957,"description":"Doing distributed log split in [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317}] 11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing distributed log split in [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]: status=Waiting for distributed tasks to finish. scheduled=4 done=0 error=3, state=RUNNING, startTime=1320796203957, completionTime=-1 appears to have been leaked 11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing distributed log split in [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]: status=Waiting for distributed tasks to finish. scheduled=4 done=0 error=2, state=RUNNING, startTime=1320796205304, completionTime=-1 appears to have been leaked 11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing distributed log split in [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]: status=Waiting for distributed tasks to finish. scheduled=4 done=0 error=1, state=RUNNING, startTime=1320796206563, completionTime=-1 appears to have been leaked And the behavior on the DNs was even weirder. I'm attaching a log from one of the DNs. The last exception is a shocker to me: 11/11/08 18:51:07 WARN regionserver.SplitLogWorker: log splitting of hdfs://ip-10-46-114-25.ec2.internal:17020/ hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting/ip-10-245-191-239.ec2.internal%2C60020 %2C1320792860210.1320796004063 failed, returning error java.io.IOException: Failed to open hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2 .internal,60020,1320792860210-splitting/ip-10-245-191-239.ec2.internal%2C60020%2C1320792860210.1320796004063 fo r append But perhaps its is cascading from some of the earlier ones. Anyway, take a look at the attached log. Now, this is a tricky issue to reproduce. Just before it started failing again I had a completely clean run over here: http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/33/testReport/ Which makes me believe it is NOT configuration related. Thanks, Roman.
-
Re: HBase 0.92/Hadoop 0.22 test resultsTed Yu 2011-11-09, 00:20
Maybe the following is related ?
11/11/08 18:50:04 WARN hdfs.DFSClient: DataStreamer Exception: java.io.IOException: File /hbase/splitlog/domU-12-31-39-09-E8-31.compute-1.internal,60020,1320792889412_hdfs%3A%2F%2Fip-10-46-114-25.ec2.internal%3A17020%2Fhbase%2F.logs%2Fip-10-245-191-239.ec2.internal%2C60020%2C1320792860210-splitting%2Fip-10-245-191-239.ec2.internal%252C60020%252C1320792860210.1320796004063/TestLoadAndVerify_1320795370905/d76a246e81525444beeea99200b3e9a4/recovered.edits/0000000000000048149 could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1646) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:829) at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) On Tue, Nov 8, 2011 at 4:10 PM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote: > +Konstantin (there's something weird in append handling) > > Some more updates. Hope this will help. I had this hunch that > I was seeing those weird issues when HDFS DN was at 80% > capacity (but nowhere near full!). So I quickly spun off a cluster > that had 5 DNs with modest (and unbalanced!) amount of > storage. Here's what started happening towards the end of > loading 2M records into HBase: > > On the master: > > {"statustimems":-1,"status":"Waiting for distributed tasks to finish. > scheduled=4 done=0 > error=3","starttimems":1320796207862,"description":"Doing distributed > log split in > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"RUNNING","statetimems":-1},{"statustimems":1320796275317,"status":"Waiting > for distributed tasks to finish. scheduled=4 done=0 > error=1","starttimems":1320796206563,"description":"Doing distributed > log split in > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317},{"statustimems":1320796275317,"status":"Waiting > for distributed tasks to finish. scheduled=4 done=0 > error=2","starttimems":1320796205304,"description":"Doing distributed > log split in > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317},{"statustimems":1320796275317,"status":"Waiting > for distributed tasks to finish. scheduled=4 done=0 > error=3","starttimems":1320796203957,"description":"Doing distributed > log split in > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317}] > > 11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing > distributed log split in > > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]: > status=Waiting for distributed tasks to finish. scheduled=4 done=0 > error=3, state=RUNNING, startTime=1320796203957, completionTime=-1 > appears to have been leaked > 11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing > distributed log split in > > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]: > status=Waiting for distributed tasks to finish. scheduled=4 done=0 > error=2, state=RUNNING, startTime=1320796205304, completionTime=-1 > appears to have been leaked > 11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing > distributed log split in > > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]: > status=Waiting for distributed tasks to finish. scheduled=4 done=0 > error=1, state=RUNNING, startTime=1320796206563, completionTime=-1 > appears to have been leaked > > And the behavior on the DNs was even weirder. I'm attaching a log > from one of the DNs. The last exception is a shocker to me: > > 11/11/08 18:51:07 WARN regionserver.SplitLogWorker: log splitting of
-
Re: HBase 0.92/Hadoop 0.22 test resultsRoman Shaposhnik 2011-11-09, 00:27
On Tue, Nov 8, 2011 at 4:20 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> Maybe the following is related ? It very well may be, but I can't explain it -- that's the trouble. I'm running close to capacity -- true, but it used to work before. That's part of the reason I CCed Konstantin (the other part is that scary message about not being able to do append). My next try is going to be to have a run where all DNs never go above 50% of storage utilization. If that cures it -- fine, but it still makes up a pretty scary failure scenario. Thanks, Roman. P.S. And there's also this consideration -- after my Load test fails all other HBase and hadoop/HDFS tests seem to pass. So the cluster is definitely getting to stable state after that failure.
-
Re: HBase 0.92/Hadoop 0.22 test resultsRoman Shaposhnik 2011-11-09, 19:44
On Tue, Nov 8, 2011 at 4:27 PM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote:
> My next try is going to be to have a run where all DNs never go above 50% > of storage utilization. If that cures it -- fine, but it still makes > up a pretty scary > failure scenario. That was successful: http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/lastCompletedBuild/testReport/ At this point, the only question remaining is why does this behavior show up when nodes run close to capacity. Thanks, Roman.
-
Re: HBase 0.92/Hadoop 0.22 test resultsTodd Lipcon 2011-11-09, 21:38
Do you have 5*BLOCK_SIZE free space on at least one of the volumes on
the DN? If these are small VMs or your dfs.data.dir is /tmp maybe 80% capacity is actually small enough that you can't allocate any more blocks? On Wed, Nov 9, 2011 at 11:44 AM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote: > On Tue, Nov 8, 2011 at 4:27 PM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote: >> My next try is going to be to have a run where all DNs never go above 50% >> of storage utilization. If that cures it -- fine, but it still makes >> up a pretty scary >> failure scenario. > > That was successful: > http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/lastCompletedBuild/testReport/ > > At this point, the only question remaining is why does this > behavior show up when nodes run close to capacity. > > Thanks, > Roman. > -- Todd Lipcon Software Engineer, Cloudera
-
Re: HBase 0.92/Hadoop 0.22 test resultsAndrew Purtell 2011-11-10, 15:22
> That's a small cluster running on EC2.
What instance type? Should use c1.xlarge or m4.4xlarge, they won't see the possibility of noisy neighbors. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) >________________________________ >From: Roman Shaposhnik <[EMAIL PROTECTED]> >To: [EMAIL PROTECTED] >Sent: Tuesday, November 8, 2011 5:29 PM >Subject: Re: HBase 0.92/Hadoop 0.22 test results > >On Tue, Nov 8, 2011 at 2:06 PM, Stack <[EMAIL PROTECTED]> wrote: >> What happened above between 00:43:40 and 00:44:31? > >Not much judging by the logs. In fact that's part of the issue here I think. > >> A big old GC? > >Unlikely -- the RS had tons of Heap, but of course anything's possible. > >> This is a standalone instance with all running in the on VM? > >That's a small cluster running on EC2. So at the very fundamental levels >these are VMs, yes. But for all practical purposes -- it is a fully distributed >standalone set of servers. > >> The YouAreDeadException happens usually when the master has figured >> the RegionServer is dead before the RegionServer has figured it out. >> This can happen when say, the RS has GC paused and first thing it does >> when it comes out of the pause is it heartbeats the master (Meantime >> its probably running the zookeeper session expiration code >> concurrently). > >Right. I'll try to look into that in my testing. I also bumped to the timeout >up to a minute (which I'm really nervous about, though). Lets see... > >Thanks, >Roman. > > >
-
Re: HBase 0.92/Hadoop 0.22 test resultsAndrew Purtell 2011-11-10, 15:25
Another thing that can happen, is if you use spot instances your spot instances can be taken back by AWS at any time. We had clusters in us-west-1 last week that were abruptly terminated without notice like this. (We use on-demand master and spot slaves, only the masters remained running... several times last week...)
Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) ----- Original Message ----- > From: Ted Yu <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Cc: > Sent: Tuesday, November 8, 2011 7:20 PM > Subject: Re: HBase 0.92/Hadoop 0.22 test results > > Maybe the following is related ? > > 11/11/08 18:50:04 WARN hdfs.DFSClient: DataStreamer Exception: > java.io.IOException: File > /hbase/splitlog/domU-12-31-39-09-E8-31.compute-1.internal,60020,1320792889412_hdfs%3A%2F%2Fip-10-46-114-25.ec2.internal%3A17020%2Fhbase%2F.logs%2Fip-10-245-191-239.ec2.internal%2C60020%2C1320792860210-splitting%2Fip-10-245-191-239.ec2.internal%252C60020%252C1320792860210.1320796004063/TestLoadAndVerify_1320795370905/d76a246e81525444beeea99200b3e9a4/recovered.edits/0000000000000048149 > could only be replicated to 0 nodes, instead of 1 > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1646) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:829) > at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) > > On Tue, Nov 8, 2011 at 4:10 PM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote: > >> +Konstantin (there's something weird in append handling) >> >> Some more updates. Hope this will help. I had this hunch that >> I was seeing those weird issues when HDFS DN was at 80% >> capacity (but nowhere near full!). So I quickly spun off a cluster >> that had 5 DNs with modest (and unbalanced!) amount of >> storage. Here's what started happening towards the end of >> loading 2M records into HBase: >> >> On the master: >> >> {"statustimems":-1,"status":"Waiting for > distributed tasks to finish. >> scheduled=4 done=0 >> > error=3","starttimems":1320796207862,"description":"Doing > distributed >> log split in >> > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"RUNNING","statetimems":-1},{"statustimems":1320796275317,"status":"Waiting >> for distributed tasks to finish. scheduled=4 done=0 >> > error=1","starttimems":1320796206563,"description":"Doing > distributed >> log split in >> > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317},{"statustimems":1320796275317,"status":"Waiting >> for distributed tasks to finish. scheduled=4 done=0 >> > error=2","starttimems":1320796205304,"description":"Doing > distributed >> log split in >> > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317},{"statustimems":1320796275317,"status":"Waiting >> for distributed tasks to finish. scheduled=4 done=0 >> > error=3","starttimems":1320796203957,"description":"Doing > distributed >> log split in >> > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317}] >> >> 11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing >> distributed log split in >> >> > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]: >> status=Waiting for distributed tasks to finish. scheduled=4 done=0 >> error=3, state=RUNNING, startTime=1320796203957, completionTime=-1 >> appears to have been leaked >> 11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing >> distributed log split in >> >> > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]:
-
Re: HBase 0.92/Hadoop 0.22 test resultsRoman Shaposhnik 2011-11-11, 00:33
On Thu, Nov 10, 2011 at 7:22 AM, Andrew Purtell <[EMAIL PROTECTED]> wrote:
>> That's a small cluster running on EC2. > > What instance type? m1.large > Should use c1.xlarge or m4.4xlarge, they won't see the possibility of noisy neighbors. I know, but these are expensive. I'm lucky enough Cloudera is being gracious footing the bill for m1.large instances and not requiring me to run spots. > Another thing that can happen, is if you use spot instances your spot instances can > be taken back by AWS at any time. We had clusters in us-west-1 last week that were > abruptly terminated without notice like this. (We use on-demand master and spot slaves, > only the masters remained running... several times last week...) That's not a problem. At least for now it isn't. Thanks, Roman. P.S. I was talking to EC2 AWS folks trying to see whether they would be in a position to donate credits for Apache Projects, but these talks are not progressing well.
-
Re: HBase 0.92/Hadoop 0.22 test resultsAndrew Purtell 2011-11-11, 03:53
I've seen the hypervisor steal back ~70% CPU time from m1.large for many seconds at a time, according to top.
If using EC2 for Hadoop+HBase, c1.xlarge is the minimum requirement in my experience. I've been testing HBase on EC2 for over a year. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) ----- Original Message ----- > From: Roman Shaposhnik <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]; Andrew Purtell <[EMAIL PROTECTED]> > Cc: > Sent: Thursday, November 10, 2011 7:33 PM > Subject: Re: HBase 0.92/Hadoop 0.22 test results > > On Thu, Nov 10, 2011 at 7:22 AM, Andrew Purtell <[EMAIL PROTECTED]> > wrote: >>> That's a small cluster running on EC2. >> >> What instance type? > > m1.large > >> Should use c1.xlarge or m4.4xlarge, they won't see the possibility of > noisy neighbors. > > I know, but these are expensive. I'm lucky enough Cloudera is being > gracious footing > the bill for m1.large instances and not requiring me to run spots. > >> Another thing that can happen, is if you use spot instances your spot > instances can >> be taken back by AWS at any time. We had clusters in us-west-1 last week > that were >> abruptly terminated without notice like this. (We use on-demand master and > spot slaves, >> only the masters remained running... several times last week...) > > That's not a problem. At least for now it isn't. > > Thanks, > Roman. > > P.S. I was talking to EC2 AWS folks trying to see whether they would > be in a position > to donate credits for Apache Projects, but these talks are not progressing well. >
-
Re: HBase 0.92/Hadoop 0.22 test resultsStack 2011-11-11, 16:50
On Thu, Nov 10, 2011 at 4:33 PM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote:
> P.S. I was talking to EC2 AWS folks trying to see whether they would > be in a position > to donate credits for Apache Projects, but these talks are not progressing well. > I talked to EC2 AWS folks at HW too (They are getting it from both ends of the country!). This is supposedly a possibility for apache hbase. You want a grant for bigtop though, right Roman? You'd think it'd be in their interest as well as ours giving the project a few credits. You've tried writing them a pretty-please? Good stuff, St.Ack
-
Re: HBase 0.92/Hadoop 0.22 test resultsRoman Shaposhnik 2011-11-11, 17:12
On Fri, Nov 11, 2011 at 8:50 AM, Stack <[EMAIL PROTECTED]> wrote:
> I talked to EC2 AWS folks at HW too (They are getting it from both > ends of the country!). This is supposedly a possibility for apache > hbase. That's exactly what they told me -- a possibility for a "few" Apache projects. I'm yet to have a single bit of practical > You want a grant for bigtop though, right Roman? Well, Bigtop was conceived as THE place for integrating and validating Apache Hadoop ecosystem projects. I'm sure individual projects will benefit from having EC2 credits on a per-project basis. That said, if we want the kind of testing I did for HBase 0.92RC/Hadoop 0.22RC happen on regular basis for a way bigger matrix of compatibility -- we need those credits for Bigtop. Think of it as umbrella. I'm still at ApacheCON (its been a long week). We've been having conversations with EMC/Greenplum about them donating CPU time on their real H/W cluster to Bigtop. The idea here is that we'd be using it for testing trunks of different Hadoop projects against each other, etc. > You've tried writing them a pretty-please? Yup. Thanks, Roman.
-
Re: HBase 0.92/Hadoop 0.22 test resultsAndrew Purtell 2011-11-11, 22:56
> Well, Bigtop was conceived as THE place for integrating
> and validating Apache Hadoop ecosystem projects. I'm > sure individual projects will benefit from having EC2 credits > on a per-project basis. That said, if we want the kind of testing > I did for HBase 0.92RC/Hadoop 0.22RC happen on regular > basis for a way bigger matrix of compatibility -- we need > those credits for Bigtop. Think of it as umbrella. One would think some small fraction of the combined $60MM infused into the community-oriented Hadoop companies recently would buy quite a few EC2 instance-minutes for BigTop specifically. Also, didn't someone set up a 1000 node cluster for public use recently? - Andy ----- Original Message ----- > From: Roman Shaposhnik <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Cc: Andrew Purtell <[EMAIL PROTECTED]> > Sent: Friday, November 11, 2011 12:12 PM > Subject: Re: HBase 0.92/Hadoop 0.22 test results > > On Fri, Nov 11, 2011 at 8:50 AM, Stack <[EMAIL PROTECTED]> wrote: >> I talked to EC2 AWS folks at HW too (They are getting it from both >> ends of the country!). This is supposedly a possibility for apache >> hbase. > > That's exactly what they told me -- a possibility for a "few" > Apache > projects. I'm yet to have a single bit of practical > >> You want a grant for bigtop though, right Roman? > > Well, Bigtop was conceived as THE place for integrating > and validating Apache Hadoop ecosystem projects. I'm > sure individual projects will benefit from having EC2 credits > on a per-project basis. That said, if we want the kind of testing > I did for HBase 0.92RC/Hadoop 0.22RC happen on regular > basis for a way bigger matrix of compatibility -- we need > those credits for Bigtop. Think of it as umbrella. > > I'm still at ApacheCON (its been a long week). We've been > having conversations with EMC/Greenplum about them > donating CPU time on their real H/W cluster to Bigtop. The > idea here is that we'd be using it for testing trunks of different > Hadoop projects against each other, etc. > >> You've tried writing them a pretty-please? > > Yup. > > Thanks, > Roman. >
-
Re: HBase 0.92/Hadoop 0.22 test resultsKonstantin Shvachko 2011-11-15, 02:19
Guys,
In the log file attached by Roman I see the exception below, which I think is the reason for failures. It says "could only be replicated to 0 nodes, instead of 1", and means that HDFS could not find any targets for the block. This could happen either if the disks are full, which I think Roman took care of. Or if there is a spike in new block creations or generally write activity. If there is instantaneous increase in block creation, the load will go over 2 * avgLoad, because DataNodes do not keep up with reporting their load. It would be interesting to turn on debug level, Then we should see why locations are not being chosen. It may also make sense to run the same test with dfs.namenode.replication.considerLoad = false Then average load will not be taken into account. By default it is true. 11/11/08 18:50:04 WARN regionserver.SplitLogWorker: log splitting of hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting/ip-10-245-191-239.ec2.internal%2C60020%2C1320792860210.1320796004063 failed, returning error java.io.IOException: File /hbase/splitlog/domU-12-31-39-09-E8-31.compute-1.internal,60020,1320792889412_hdfs%3A%2F%2Fip-10-46-114-25.ec2.internal%3A17020%2Fhbase%2F.logs%2Fip-10-245-191-239.ec2.internal%2C60020%2C1320792860210-splitting%2Fip-10-245-191-239.ec2.internal%252C60020%252C1320792860210.1320796004063/TestLoadAndVerify_1320795370905/26bcb17daa237ce131b83e43ee48224c/recovered.edits/0000000000000048153 could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1646) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:829) at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:349) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1482) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1478) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1153) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1476) at org.apache.hadoop.ipc.Client.call(Client.java:1028) at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:198) at $Proxy7.addBlock(Unknown Source) at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:84) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy7.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:975) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:847) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:447) On Tue, Nov 8, 2011 at 4:10 PM, Roman Shaposhnik <[EMAIL PROTECTED]> wrote: > +Konstantin (there's something weird in append handling) > > Some more updates. Hope this will help. I had this hunch that > I was seeing those weird issues when HDFS DN was at 80% > capacity (but nowhere near full!). So I quickly spun off a cluster > that had 5 DNs with modest (and unbalanced!) amount of > storage. Here's what started happening towards the end of > loading 2M records into HBase: > > On the master: > > {"statustimems":-1,"status":"Waiting for distributed tasks to finish. > scheduled=4 done=0 > error=3","starttimems":1320796207862,"description":"Doing distributed > log split in [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"RUNNING","statetimems":-1},{"statustimems":1320796275317,"status":"Waiting |