|
|
-
5 seconds pause connecting to ZK in unit tests
Jean-Daniel Cryans 2012-11-07, 18:13
Hey devs,
I'm wondering if anyone else saw this. I was running TestReplication on my machine and I saw it failing in testDeleteTypes trying to replicate _3_ rows:
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 59.62 sec <<< FAILURE! testDeleteTypes(org.apache.hadoop.hbase.replication.TestReplication) Time elapsed: 0.081 sec <<< FAILURE! java.lang.AssertionError: Waited too much time for put replication at org.junit.Assert.fail(Assert.java:93) at org.apache.hadoop.hbase.replication.TestReplication.testDeleteTypes(TestReplication.java:242)
Digging down, I was able to jstack while the test is waiting for replication and I see:
Thread[REPL IPC Server handler 0 on 62520,5,main] at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1259) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:998) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:171)
It stays there the whole time. Adding debug around exists I see:
2012-11-07 10:01:15,759 INFO [REPL IPC Server handler 0 on 62520] zookeeper.RecoverableZooKeeper(170): Running exists on /2/master 2012-11-07 10:01:20,244 INFO [REPL IPC Server handler 0 on 62520] zookeeper.RecoverableZooKeeper(184): Done with exists on /2/master
If I turn on debug for ZK I see that ZK doesn't process the exists until a few ms before the call returns.
FWIW I see other exits() hanging for about 5 seconds in the log, and it always seems to happen on the first call after the session is started.
Elliott saw the same issue with his machine yesterday but I'm wondering if anyone else saw it and/or have a clue.
Thx,
J-D
+
Jean-Daniel Cryans 2012-11-07, 18:13
-
Re: 5 seconds pause connecting to ZK in unit tests
Jean-Daniel Cryans 2012-11-07, 18:18
Or if I use the shell on my machine:
12/11/07 10:17:15 INFO zookeeper.RecoverableZooKeeper: Running exists on /hbase/master 12/11/07 10:17:20 INFO zookeeper.ClientCnxn: Socket connection established to fe80:0:0:0:0:0:0:1%1/fe80:0:0:0:0:0:0:1%1:2181, initiating session 12/11/07 10:17:20 DEBUG zookeeper.ClientCnxn: Session establishment request sent on fe80:0:0:0:0:0:0:1%1/fe80:0:0:0:0:0:0:1%1:2181 12/11/07 10:17:20 INFO zookeeper.ClientCnxn: Session establishment complete on server fe80:0:0:0:0:0:0:1%1/fe80:0:0:0:0:0:0:1%1:2181, sessionid = 0x13adc16563a0003, negotiated timeout = 40000 12/11/07 10:17:20 DEBUG zookeeper.ClientCnxn: Reading reply sessionid:0x13adc16563a0003, packet:: clientPath:null serverPath:null finished:false header:: 1,3 replyHeader:: 1,36,0 request:: '/hbase/master,T response:: s{10,10,1352312182945,1352312182945,0,0,0,88625131181178880,73,0,10} 12/11/07 10:17:20 INFO zookeeper.RecoverableZooKeeper: Done with exists on /hbase/master
This sounds DNS-y...
On Wed, Nov 7, 2012 at 10:13 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: > Hey devs, > > I'm wondering if anyone else saw this. I was running TestReplication > on my machine and I saw it failing in testDeleteTypes trying to > replicate _3_ rows: > > Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 59.62 > sec <<< FAILURE! > testDeleteTypes(org.apache.hadoop.hbase.replication.TestReplication) > Time elapsed: 0.081 sec <<< FAILURE! > java.lang.AssertionError: Waited too much time for put replication > at org.junit.Assert.fail(Assert.java:93) > at org.apache.hadoop.hbase.replication.TestReplication.testDeleteTypes(TestReplication.java:242) > > Digging down, I was able to jstack while the test is waiting for > replication and I see: > > Thread[REPL IPC Server handler 0 on 62520,5,main] > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1259) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:998) > at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:171) > > It stays there the whole time. Adding debug around exists I see: > > 2012-11-07 10:01:15,759 INFO [REPL IPC Server handler 0 on 62520] > zookeeper.RecoverableZooKeeper(170): Running exists on /2/master > 2012-11-07 10:01:20,244 INFO [REPL IPC Server handler 0 on 62520] > zookeeper.RecoverableZooKeeper(184): Done with exists on /2/master > > If I turn on debug for ZK I see that ZK doesn't process the exists > until a few ms before the call returns. > > FWIW I see other exits() hanging for about 5 seconds in the log, and > it always seems to happen on the first call after the session is > started. > > Elliott saw the same issue with his machine yesterday but I'm > wondering if anyone else saw it and/or have a clue. > > Thx, > > J-D
+
Jean-Daniel Cryans 2012-11-07, 18:18
-
Re: 5 seconds pause connecting to ZK in unit tests
Kevin O'dell 2012-11-07, 18:22
Can you disable IPv6 on that machine and see if it will connect over IPv4 or set the java setting(don't remember off the top of my head) to prefer IPv4?
On Wed, Nov 7, 2012 at 1:18 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote:
> Or if I use the shell on my machine: > > 12/11/07 10:17:15 INFO zookeeper.RecoverableZooKeeper: Running exists > on /hbase/master > 12/11/07 10:17:20 INFO zookeeper.ClientCnxn: Socket connection > established to fe80:0:0:0:0:0:0:1%1/fe80:0:0:0:0:0:0:1%1:2181, > initiating session > 12/11/07 10:17:20 DEBUG zookeeper.ClientCnxn: Session establishment > request sent on fe80:0:0:0:0:0:0:1%1/fe80:0:0:0:0:0:0:1%1:2181 > 12/11/07 10:17:20 INFO zookeeper.ClientCnxn: Session establishment > complete on server fe80:0:0:0:0:0:0:1%1/fe80:0:0:0:0:0:0:1%1:2181, > sessionid = 0x13adc16563a0003, negotiated timeout = 40000 > 12/11/07 10:17:20 DEBUG zookeeper.ClientCnxn: Reading reply > sessionid:0x13adc16563a0003, packet:: clientPath:null serverPath:null > finished:false header:: 1,3 replyHeader:: 1,36,0 request:: > '/hbase/master,T response:: > s{10,10,1352312182945,1352312182945,0,0,0,88625131181178880,73,0,10} > 12/11/07 10:17:20 INFO zookeeper.RecoverableZooKeeper: Done with > exists on /hbase/master > > This sounds DNS-y... > > On Wed, Nov 7, 2012 at 10:13 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]> > wrote: > > Hey devs, > > > > I'm wondering if anyone else saw this. I was running TestReplication > > on my machine and I saw it failing in testDeleteTypes trying to > > replicate _3_ rows: > > > > Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 59.62 > > sec <<< FAILURE! > > testDeleteTypes(org.apache.hadoop.hbase.replication.TestReplication) > > Time elapsed: 0.081 sec <<< FAILURE! > > java.lang.AssertionError: Waited too much time for put replication > > at org.junit.Assert.fail(Assert.java:93) > > at > org.apache.hadoop.hbase.replication.TestReplication.testDeleteTypes(TestReplication.java:242) > > > > Digging down, I was able to jstack while the test is waiting for > > replication and I see: > > > > Thread[REPL IPC Server handler 0 on 62520,5,main] > > at java.lang.Object.wait(Native Method) > > at java.lang.Object.wait(Object.java:485) > > at > org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1259) > > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:998) > > at > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:171) > > > > It stays there the whole time. Adding debug around exists I see: > > > > 2012-11-07 10:01:15,759 INFO [REPL IPC Server handler 0 on 62520] > > zookeeper.RecoverableZooKeeper(170): Running exists on /2/master > > 2012-11-07 10:01:20,244 INFO [REPL IPC Server handler 0 on 62520] > > zookeeper.RecoverableZooKeeper(184): Done with exists on /2/master > > > > If I turn on debug for ZK I see that ZK doesn't process the exists > > until a few ms before the call returns. > > > > FWIW I see other exits() hanging for about 5 seconds in the log, and > > it always seems to happen on the first call after the session is > > started. > > > > Elliott saw the same issue with his machine yesterday but I'm > > wondering if anyone else saw it and/or have a clue. > > > > Thx, > > > > J-D >
-- Kevin O'Dell Customer Operations Engineer, Cloudera
+
Kevin O'dell 2012-11-07, 18:22
-
Re: 5 seconds pause connecting to ZK in unit tests
Jean-Daniel Cryans 2012-11-07, 18:28
I was reading that it's impossible to do in OSX but totally forgot about -Djava.net.preferIPv4Stack=true. And it does fix the issue.
I wonder how much this is affecting users.
J-D
On Wed, Nov 7, 2012 at 10:22 AM, Kevin O'dell <[EMAIL PROTECTED]> wrote: > Can you disable IPv6 on that machine and see if it will connect over IPv4 > or set the java setting(don't remember off the top of my head) to prefer > IPv4? > > On Wed, Nov 7, 2012 at 1:18 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > >> Or if I use the shell on my machine: >> >> 12/11/07 10:17:15 INFO zookeeper.RecoverableZooKeeper: Running exists >> on /hbase/master >> 12/11/07 10:17:20 INFO zookeeper.ClientCnxn: Socket connection >> established to fe80:0:0:0:0:0:0:1%1/fe80:0:0:0:0:0:0:1%1:2181, >> initiating session >> 12/11/07 10:17:20 DEBUG zookeeper.ClientCnxn: Session establishment >> request sent on fe80:0:0:0:0:0:0:1%1/fe80:0:0:0:0:0:0:1%1:2181 >> 12/11/07 10:17:20 INFO zookeeper.ClientCnxn: Session establishment >> complete on server fe80:0:0:0:0:0:0:1%1/fe80:0:0:0:0:0:0:1%1:2181, >> sessionid = 0x13adc16563a0003, negotiated timeout = 40000 >> 12/11/07 10:17:20 DEBUG zookeeper.ClientCnxn: Reading reply >> sessionid:0x13adc16563a0003, packet:: clientPath:null serverPath:null >> finished:false header:: 1,3 replyHeader:: 1,36,0 request:: >> '/hbase/master,T response:: >> s{10,10,1352312182945,1352312182945,0,0,0,88625131181178880,73,0,10} >> 12/11/07 10:17:20 INFO zookeeper.RecoverableZooKeeper: Done with >> exists on /hbase/master >> >> This sounds DNS-y... >> >> On Wed, Nov 7, 2012 at 10:13 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]> >> wrote: >> > Hey devs, >> > >> > I'm wondering if anyone else saw this. I was running TestReplication >> > on my machine and I saw it failing in testDeleteTypes trying to >> > replicate _3_ rows: >> > >> > Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 59.62 >> > sec <<< FAILURE! >> > testDeleteTypes(org.apache.hadoop.hbase.replication.TestReplication) >> > Time elapsed: 0.081 sec <<< FAILURE! >> > java.lang.AssertionError: Waited too much time for put replication >> > at org.junit.Assert.fail(Assert.java:93) >> > at >> org.apache.hadoop.hbase.replication.TestReplication.testDeleteTypes(TestReplication.java:242) >> > >> > Digging down, I was able to jstack while the test is waiting for >> > replication and I see: >> > >> > Thread[REPL IPC Server handler 0 on 62520,5,main] >> > at java.lang.Object.wait(Native Method) >> > at java.lang.Object.wait(Object.java:485) >> > at >> org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1259) >> > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:998) >> > at >> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:171) >> > >> > It stays there the whole time. Adding debug around exists I see: >> > >> > 2012-11-07 10:01:15,759 INFO [REPL IPC Server handler 0 on 62520] >> > zookeeper.RecoverableZooKeeper(170): Running exists on /2/master >> > 2012-11-07 10:01:20,244 INFO [REPL IPC Server handler 0 on 62520] >> > zookeeper.RecoverableZooKeeper(184): Done with exists on /2/master >> > >> > If I turn on debug for ZK I see that ZK doesn't process the exists >> > until a few ms before the call returns. >> > >> > FWIW I see other exits() hanging for about 5 seconds in the log, and >> > it always seems to happen on the first call after the session is >> > started. >> > >> > Elliott saw the same issue with his machine yesterday but I'm >> > wondering if anyone else saw it and/or have a clue. >> > >> > Thx, >> > >> > J-D >> > > > > -- > Kevin O'Dell > Customer Operations Engineer, Cloudera
+
Jean-Daniel Cryans 2012-11-07, 18:28
-
Re: 5 seconds pause connecting to ZK in unit tests
Andrew Purtell 2012-11-07, 18:33
Or Hudsons/Jenkins running the HBase test suite. On Wed, Nov 7, 2012 at 10:28 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote:
> I was reading that it's impossible to do in OSX but totally forgot > about -Djava.net.preferIPv4Stack=true. And it does fix the issue. > > I wonder how much this is affecting users. > > J-D > > On Wed, Nov 7, 2012 at 10:22 AM, Kevin O'dell <[EMAIL PROTECTED]> > wrote: > > Can you disable IPv6 on that machine and see if it will connect over IPv4 > > or set the java setting(don't remember off the top of my head) to prefer > > IPv4? > > > > On Wed, Nov 7, 2012 at 1:18 PM, Jean-Daniel Cryans <[EMAIL PROTECTED] > >wrote: > > > >> Or if I use the shell on my machine: > >> > >> 12/11/07 10:17:15 INFO zookeeper.RecoverableZooKeeper: Running exists > >> on /hbase/master > >> 12/11/07 10:17:20 INFO zookeeper.ClientCnxn: Socket connection > >> established to fe80:0:0:0:0:0:0:1%1/fe80:0:0:0:0:0:0:1%1:2181, > >> initiating session > >> 12/11/07 10:17:20 DEBUG zookeeper.ClientCnxn: Session establishment > >> request sent on fe80:0:0:0:0:0:0:1%1/fe80:0:0:0:0:0:0:1%1:2181 > >> 12/11/07 10:17:20 INFO zookeeper.ClientCnxn: Session establishment > >> complete on server fe80:0:0:0:0:0:0:1%1/fe80:0:0:0:0:0:0:1%1:2181, > >> sessionid = 0x13adc16563a0003, negotiated timeout = 40000 > >> 12/11/07 10:17:20 DEBUG zookeeper.ClientCnxn: Reading reply > >> sessionid:0x13adc16563a0003, packet:: clientPath:null serverPath:null > >> finished:false header:: 1,3 replyHeader:: 1,36,0 request:: > >> '/hbase/master,T response:: > >> s{10,10,1352312182945,1352312182945,0,0,0,88625131181178880,73,0,10} > >> 12/11/07 10:17:20 INFO zookeeper.RecoverableZooKeeper: Done with > >> exists on /hbase/master > >> > >> This sounds DNS-y... > >> > >> On Wed, Nov 7, 2012 at 10:13 AM, Jean-Daniel Cryans < > [EMAIL PROTECTED]> > >> wrote: > >> > Hey devs, > >> > > >> > I'm wondering if anyone else saw this. I was running TestReplication > >> > on my machine and I saw it failing in testDeleteTypes trying to > >> > replicate _3_ rows: > >> > > >> > Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 59.62 > >> > sec <<< FAILURE! > >> > testDeleteTypes(org.apache.hadoop.hbase.replication.TestReplication) > >> > Time elapsed: 0.081 sec <<< FAILURE! > >> > java.lang.AssertionError: Waited too much time for put replication > >> > at org.junit.Assert.fail(Assert.java:93) > >> > at > >> > org.apache.hadoop.hbase.replication.TestReplication.testDeleteTypes(TestReplication.java:242) > >> > > >> > Digging down, I was able to jstack while the test is waiting for > >> > replication and I see: > >> > > >> > Thread[REPL IPC Server handler 0 on 62520,5,main] > >> > at java.lang.Object.wait(Native Method) > >> > at java.lang.Object.wait(Object.java:485) > >> > at > >> org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1259) > >> > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:998) > >> > at > >> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:171) > >> > > >> > It stays there the whole time. Adding debug around exists I see: > >> > > >> > 2012-11-07 10:01:15,759 INFO [REPL IPC Server handler 0 on 62520] > >> > zookeeper.RecoverableZooKeeper(170): Running exists on /2/master > >> > 2012-11-07 10:01:20,244 INFO [REPL IPC Server handler 0 on 62520] > >> > zookeeper.RecoverableZooKeeper(184): Done with exists on /2/master > >> > > >> > If I turn on debug for ZK I see that ZK doesn't process the exists > >> > until a few ms before the call returns. > >> > > >> > FWIW I see other exits() hanging for about 5 seconds in the log, and > >> > it always seems to happen on the first call after the session is > >> > started. > >> > > >> > Elliott saw the same issue with his machine yesterday but I'm > >> > wondering if anyone else saw it and/or have a clue. > >> > > >> > Thx, > >> > > >> > J-D > >> > > > > > > > > -- > > Kevin O'Dell
Best regards,
- Andy
Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
+
Andrew Purtell 2012-11-07, 18:33
-
Re: 5 seconds pause connecting to ZK in unit tests
Jean-Daniel Cryans 2012-11-07, 18:53
Yeah we could do something like this in the pom:
<argLine>-enableassertions -Xmx1900m - -Djava.security.egd=file:/dev/./urandom</argLine> + -Djava.security.egd=file:/dev/./urandom + -Djava.net.preferIPv4Stack=true</argLine>
As long as we don't test something related to IPV6 :)
On Wed, Nov 7, 2012 at 10:33 AM, Andrew Purtell <[EMAIL PROTECTED]> wrote: > Or Hudsons/Jenkins running the HBase test suite. > > > On Wed, Nov 7, 2012 at 10:28 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > >> I was reading that it's impossible to do in OSX but totally forgot >> about -Djava.net.preferIPv4Stack=true. And it does fix the issue. >> >> I wonder how much this is affecting users. >> >> J-D >> >> On Wed, Nov 7, 2012 at 10:22 AM, Kevin O'dell <[EMAIL PROTECTED]> >> wrote: >> > Can you disable IPv6 on that machine and see if it will connect over IPv4 >> > or set the java setting(don't remember off the top of my head) to prefer >> > IPv4? >> > >> > On Wed, Nov 7, 2012 at 1:18 PM, Jean-Daniel Cryans <[EMAIL PROTECTED] >> >wrote: >> > >> >> Or if I use the shell on my machine: >> >> >> >> 12/11/07 10:17:15 INFO zookeeper.RecoverableZooKeeper: Running exists >> >> on /hbase/master >> >> 12/11/07 10:17:20 INFO zookeeper.ClientCnxn: Socket connection >> >> established to fe80:0:0:0:0:0:0:1%1/fe80:0:0:0:0:0:0:1%1:2181, >> >> initiating session >> >> 12/11/07 10:17:20 DEBUG zookeeper.ClientCnxn: Session establishment >> >> request sent on fe80:0:0:0:0:0:0:1%1/fe80:0:0:0:0:0:0:1%1:2181 >> >> 12/11/07 10:17:20 INFO zookeeper.ClientCnxn: Session establishment >> >> complete on server fe80:0:0:0:0:0:0:1%1/fe80:0:0:0:0:0:0:1%1:2181, >> >> sessionid = 0x13adc16563a0003, negotiated timeout = 40000 >> >> 12/11/07 10:17:20 DEBUG zookeeper.ClientCnxn: Reading reply >> >> sessionid:0x13adc16563a0003, packet:: clientPath:null serverPath:null >> >> finished:false header:: 1,3 replyHeader:: 1,36,0 request:: >> >> '/hbase/master,T response:: >> >> s{10,10,1352312182945,1352312182945,0,0,0,88625131181178880,73,0,10} >> >> 12/11/07 10:17:20 INFO zookeeper.RecoverableZooKeeper: Done with >> >> exists on /hbase/master >> >> >> >> This sounds DNS-y... >> >> >> >> On Wed, Nov 7, 2012 at 10:13 AM, Jean-Daniel Cryans < >> [EMAIL PROTECTED]> >> >> wrote: >> >> > Hey devs, >> >> > >> >> > I'm wondering if anyone else saw this. I was running TestReplication >> >> > on my machine and I saw it failing in testDeleteTypes trying to >> >> > replicate _3_ rows: >> >> > >> >> > Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 59.62 >> >> > sec <<< FAILURE! >> >> > testDeleteTypes(org.apache.hadoop.hbase.replication.TestReplication) >> >> > Time elapsed: 0.081 sec <<< FAILURE! >> >> > java.lang.AssertionError: Waited too much time for put replication >> >> > at org.junit.Assert.fail(Assert.java:93) >> >> > at >> >> >> org.apache.hadoop.hbase.replication.TestReplication.testDeleteTypes(TestReplication.java:242) >> >> > >> >> > Digging down, I was able to jstack while the test is waiting for >> >> > replication and I see: >> >> > >> >> > Thread[REPL IPC Server handler 0 on 62520,5,main] >> >> > at java.lang.Object.wait(Native Method) >> >> > at java.lang.Object.wait(Object.java:485) >> >> > at >> >> org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1259) >> >> > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:998) >> >> > at >> >> >> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:171) >> >> > >> >> > It stays there the whole time. Adding debug around exists I see: >> >> > >> >> > 2012-11-07 10:01:15,759 INFO [REPL IPC Server handler 0 on 62520] >> >> > zookeeper.RecoverableZooKeeper(170): Running exists on /2/master >> >> > 2012-11-07 10:01:20,244 INFO [REPL IPC Server handler 0 on 62520] >> >> > zookeeper.RecoverableZooKeeper(184): Done with exists on /2/master >> >> > >> >> > If I turn on debug for ZK I see that ZK doesn't process the exists
+
Jean-Daniel Cryans 2012-11-07, 18:53
|
|