|
Patrick Hunt
2011-11-05, 17:14
Mahadev Konar
2011-11-05, 18:59
Flavio Junqueira
2011-11-05, 19:01
Camille Fournier
2011-11-05, 19:15
Flavio Junqueira
2011-11-05, 19:22
Patrick Hunt
2011-11-07, 23:23
Camille Fournier
2011-11-08, 00:20
Flavio Junqueira
2011-11-08, 15:37
Camille Fournier
2011-11-08, 19:01
Camille Fournier
2011-11-08, 19:25
Patrick Hunt
2011-11-08, 22:01
|
-
Update on my 1270 testingPatrick Hunt 2011-11-05, 17:14
I ran the 1270-1194 patch continually overnight (trunk) in my ci env,
after ~25 test runs I saw 4 failures: 1) #402 - QuorumTest.testFollowersStartAfterLeader 2) #407 - org.apache.zookeeper.test.FLETest.testLE 3) #410 - org.apache.zookeeper.test.AsyncHammerTest.testHammer 4) #415 - org.apache.zookeeper.test.AsyncHammerTest.testHammer 1) client could not connect to reestablished quorum: giving up after 30+ seconds. 2) known flakey test 3) QP failed to shutdown in 30 seconds: QuorumPeer[myid=3]0.0.0.0/0.0.0.0:11224 4) QP failed to shutdown in 30 seconds: QuorumPeer[myid=1]0.0.0.0/0.0.0.0:11222 On the plus side no "testearlyleaderabandon" failures. On the minus side 3/4 are a bit worrysome. Searching back through all my previous failures I don't see this happening. Perhaps these changes have shifted some timing? My main concern is that this might be caused directly by the patch itself.... Patrick
-
Re: Update on my 1270 testingMahadev Konar 2011-11-05, 18:59
Thanks for stats Pat. 3) and 4) though a little worrisome but we can
open a jira against 3.4.1 and look at fixing them later. I'd think they shouldnt be a blocker for 3.4 release. What do others think? thanks mahadev On Sat, Nov 5, 2011 at 10:14 AM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > I ran the 1270-1194 patch continually overnight (trunk) in my ci env, > after ~25 test runs I saw 4 failures: > > 1) #402 - QuorumTest.testFollowersStartAfterLeader > 2) #407 - org.apache.zookeeper.test.FLETest.testLE > 3) #410 - org.apache.zookeeper.test.AsyncHammerTest.testHammer > 4) #415 - org.apache.zookeeper.test.AsyncHammerTest.testHammer > > 1) client could not connect to reestablished quorum: giving up after > 30+ seconds. > 2) known flakey test > 3) QP failed to shutdown in 30 seconds: QuorumPeer[myid=3]0.0.0.0/0.0.0.0:11224 > 4) QP failed to shutdown in 30 seconds: QuorumPeer[myid=1]0.0.0.0/0.0.0.0:11222 > > On the plus side no "testearlyleaderabandon" failures. > > On the minus side 3/4 are a bit worrysome. Searching back through all > my previous failures I don't see this happening. Perhaps these changes > have shifted some timing? My main concern is that this might be caused > directly by the patch itself.... > > Patrick >
-
Re: Update on my 1270 testingFlavio Junqueira 2011-11-05, 19:01
If 2) is flakey, we need to fix it, no?
-Flavio On Nov 5, 2011, at 6:14 PM, Patrick Hunt wrote: > I ran the 1270-1194 patch continually overnight (trunk) in my ci env, > after ~25 test runs I saw 4 failures: > > 1) #402 - QuorumTest.testFollowersStartAfterLeader > 2) #407 - org.apache.zookeeper.test.FLETest.testLE > 3) #410 - org.apache.zookeeper.test.AsyncHammerTest.testHammer > 4) #415 - org.apache.zookeeper.test.AsyncHammerTest.testHammer > > 1) client could not connect to reestablished quorum: giving up after > 30+ seconds. > 2) known flakey test > 3) QP failed to shutdown in 30 seconds: > QuorumPeer[myid=3]0.0.0.0/0.0.0.0:11224 > 4) QP failed to shutdown in 30 seconds: > QuorumPeer[myid=1]0.0.0.0/0.0.0.0:11222 > > On the plus side no "testearlyleaderabandon" failures. > > On the minus side 3/4 are a bit worrysome. Searching back through all > my previous failures I don't see this happening. Perhaps these changes > have shifted some timing? My main concern is that this might be caused > directly by the patch itself.... > > Patrick flavio junqueira research scientist [EMAIL PROTECTED] direct +34 93-183-8828 avinguda diagonal 177, 8th floor, barcelona, 08018, es phone (408) 349 3300 fax (408) 349 3301
-
Re: Update on my 1270 testingCamille Fournier 2011-11-05, 19:15
2 has been flaky for so long, not sure whether it's worth being a blocker.
The AsyncHammerTests never pass for me locally. Not sure if it's a problem or not... I am tempted to go with Mahadev on this and get this 3.4 release out the door. I would be happy to help manage a 3.4.1 release soon thereafter if we find serious issues. C On Sat, Nov 5, 2011 at 3:01 PM, Flavio Junqueira <[EMAIL PROTECTED]> wrote: > If 2) is flakey, we need to fix it, no? > > -Flavio > > On Nov 5, 2011, at 6:14 PM, Patrick Hunt wrote: > >> I ran the 1270-1194 patch continually overnight (trunk) in my ci env, >> after ~25 test runs I saw 4 failures: >> >> 1) #402 - QuorumTest.testFollowersStartAfterLeader >> 2) #407 - org.apache.zookeeper.test.FLETest.testLE >> 3) #410 - org.apache.zookeeper.test.AsyncHammerTest.testHammer >> 4) #415 - org.apache.zookeeper.test.AsyncHammerTest.testHammer >> >> 1) client could not connect to reestablished quorum: giving up after >> 30+ seconds. >> 2) known flakey test >> 3) QP failed to shutdown in 30 seconds: >> QuorumPeer[myid=3]0.0.0.0/0.0.0.0:11224 >> 4) QP failed to shutdown in 30 seconds: >> QuorumPeer[myid=1]0.0.0.0/0.0.0.0:11222 >> >> On the plus side no "testearlyleaderabandon" failures. >> >> On the minus side 3/4 are a bit worrysome. Searching back through all >> my previous failures I don't see this happening. Perhaps these changes >> have shifted some timing? My main concern is that this might be caused >> directly by the patch itself.... >> >> Patrick > > flavio > junqueira > > research scientist > > [EMAIL PROTECTED] > direct +34 93-183-8828 > > avinguda diagonal 177, 8th floor, barcelona, 08018, es > phone (408) 349 3300 fax (408) 349 3301 > >
-
Re: Update on my 1270 testingFlavio Junqueira 2011-11-05, 19:22
I'm fine with your proposal. -Flavio
On Nov 5, 2011, at 8:15 PM, Camille Fournier wrote: > 2 has been flaky for so long, not sure whether it's worth being a > blocker. > The AsyncHammerTests never pass for me locally. Not sure if it's a > problem or not... I am tempted to go with Mahadev on this and get this > 3.4 release out the door. I would be happy to help manage a 3.4.1 > release soon thereafter if we find serious issues. > > C > > On Sat, Nov 5, 2011 at 3:01 PM, Flavio Junqueira <[EMAIL PROTECTED]> > wrote: >> If 2) is flakey, we need to fix it, no? >> >> -Flavio >> >> On Nov 5, 2011, at 6:14 PM, Patrick Hunt wrote: >> >>> I ran the 1270-1194 patch continually overnight (trunk) in my ci >>> env, >>> after ~25 test runs I saw 4 failures: >>> >>> 1) #402 - QuorumTest.testFollowersStartAfterLeader >>> 2) #407 - org.apache.zookeeper.test.FLETest.testLE >>> 3) #410 - org.apache.zookeeper.test.AsyncHammerTest.testHammer >>> 4) #415 - org.apache.zookeeper.test.AsyncHammerTest.testHammer >>> >>> 1) client could not connect to reestablished quorum: giving up after >>> 30+ seconds. >>> 2) known flakey test >>> 3) QP failed to shutdown in 30 seconds: >>> QuorumPeer[myid=3]0.0.0.0/0.0.0.0:11224 >>> 4) QP failed to shutdown in 30 seconds: >>> QuorumPeer[myid=1]0.0.0.0/0.0.0.0:11222 >>> >>> On the plus side no "testearlyleaderabandon" failures. >>> >>> On the minus side 3/4 are a bit worrysome. Searching back through >>> all >>> my previous failures I don't see this happening. Perhaps these >>> changes >>> have shifted some timing? My main concern is that this might be >>> caused >>> directly by the patch itself.... >>> >>> Patrick >> >> flavio >> junqueira >> >> research scientist >> >> [EMAIL PROTECTED] >> direct +34 93-183-8828 >> >> avinguda diagonal 177, 8th floor, barcelona, 08018, es >> phone (408) 349 3300 fax (408) 349 3301 >> >> flavio junqueira research scientist [EMAIL PROTECTED] direct +34 93-183-8828 avinguda diagonal 177, 8th floor, barcelona, 08018, es phone (408) 349 3300 fax (408) 349 3301
-
Re: Update on my 1270 testingPatrick Hunt 2011-11-07, 23:23
That's fine (direction re 1-4). However my CI branch 3.4 build failed
over the w/e (once out of four runs). This is AFTER "Preparing for release 3.4.0 - take 2" was applied (so testing includes 1270, 1264, etc...) Notice testEarlyLeaderAbandonment is failing. I have attached the log file to ZOOKEEPER-1270 JIRA: https://issues.apache.org/jira/secure/attachment/12502838/testEarlyLeaderAbandonment5.txt.gz java.lang.RuntimeException: Waiting too long at org.apache.zookeeper.server.quorum.QuorumPeerMainTest.waitForAll(QuorumPeerMainTest.java:324) at org.apache.zookeeper.server.quorum.QuorumPeerMainTest.testEarlyLeaderAbandonment(QuorumPeerMainTest.java:195) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) Should I reopen 1270, or a new jira, or... ? LMK. Note - I'm feeling quite ill so I have limited time to provide f/b & test for the next day or so. Patrick On Sat, Nov 5, 2011 at 12:22 PM, Flavio Junqueira <[EMAIL PROTECTED]> wrote: > I'm fine with your proposal. -Flavio > > On Nov 5, 2011, at 8:15 PM, Camille Fournier wrote: > >> 2 has been flaky for so long, not sure whether it's worth being a blocker. >> The AsyncHammerTests never pass for me locally. Not sure if it's a >> problem or not... I am tempted to go with Mahadev on this and get this >> 3.4 release out the door. I would be happy to help manage a 3.4.1 >> release soon thereafter if we find serious issues. >> >> C >> >> On Sat, Nov 5, 2011 at 3:01 PM, Flavio Junqueira <[EMAIL PROTECTED]> >> wrote: >>> >>> If 2) is flakey, we need to fix it, no? >>> >>> -Flavio >>> >>> On Nov 5, 2011, at 6:14 PM, Patrick Hunt wrote: >>> >>>> I ran the 1270-1194 patch continually overnight (trunk) in my ci env, >>>> after ~25 test runs I saw 4 failures: >>>> >>>> 1) #402 - QuorumTest.testFollowersStartAfterLeader >>>> 2) #407 - org.apache.zookeeper.test.FLETest.testLE >>>> 3) #410 - org.apache.zookeeper.test.AsyncHammerTest.testHammer >>>> 4) #415 - org.apache.zookeeper.test.AsyncHammerTest.testHammer >>>> >>>> 1) client could not connect to reestablished quorum: giving up after >>>> 30+ seconds. >>>> 2) known flakey test >>>> 3) QP failed to shutdown in 30 seconds: >>>> QuorumPeer[myid=3]0.0.0.0/0.0.0.0:11224 >>>> 4) QP failed to shutdown in 30 seconds: >>>> QuorumPeer[myid=1]0.0.0.0/0.0.0.0:11222 >>>> >>>> On the plus side no "testearlyleaderabandon" failures. >>>> >>>> On the minus side 3/4 are a bit worrysome. Searching back through all >>>> my previous failures I don't see this happening. Perhaps these changes >>>> have shifted some timing? My main concern is that this might be caused >>>> directly by the patch itself.... >>>> >>>> Patrick >>> >>> flavio >>> junqueira >>> >>> research scientist >>> >>> [EMAIL PROTECTED] >>> direct +34 93-183-8828 >>> >>> avinguda diagonal 177, 8th floor, barcelona, 08018, es >>> phone (408) 349 3300 fax (408) 349 3301 >>> >>> > > flavio > junqueira > > research scientist > > [EMAIL PROTECTED] > direct +34 93-183-8828 > > avinguda diagonal 177, 8th floor, barcelona, 08018, es > phone (408) 349 3300 fax (408) 349 3301 > >
-
Re: Update on my 1270 testingCamille Fournier 2011-11-08, 00:20
Sorry you're feeling bad, Patrick! We can take it from here.
I would really like to get some clarification on this test from some of the LE experts. What does it really mean that this test is failing? Is this sort of failure that means that sometimes we have server startup that takes a bit longer because leader gives up the election, or will server startup completely hang due to this? If it's the latter, it should be a high priority fix for 3.4, but if it means that occasionally startup might have to fail and retry once, it might be worth worry about in 3.4.1. Thoughts? C On Mon, Nov 7, 2011 at 6:23 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > That's fine (direction re 1-4). However my CI branch 3.4 build failed > over the w/e (once out of four runs). This is AFTER "Preparing for > release 3.4.0 - take 2" was applied (so testing includes 1270, 1264, > etc...) > > Notice testEarlyLeaderAbandonment is failing. I have attached the log > file to ZOOKEEPER-1270 JIRA: > https://issues.apache.org/jira/secure/attachment/12502838/testEarlyLeaderAbandonment5.txt.gz > > java.lang.RuntimeException: Waiting too long > at org.apache.zookeeper.server.quorum.QuorumPeerMainTest.waitForAll(QuorumPeerMainTest.java:324) > at org.apache.zookeeper.server.quorum.QuorumPeerMainTest.testEarlyLeaderAbandonment(QuorumPeerMainTest.java:195) > at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > > Should I reopen 1270, or a new jira, or... ? LMK. > > Note - I'm feeling quite ill so I have limited time to provide f/b & > test for the next day or so. > > Patrick > > On Sat, Nov 5, 2011 at 12:22 PM, Flavio Junqueira <[EMAIL PROTECTED]> wrote: >> I'm fine with your proposal. -Flavio >> >> On Nov 5, 2011, at 8:15 PM, Camille Fournier wrote: >> >>> 2 has been flaky for so long, not sure whether it's worth being a blocker. >>> The AsyncHammerTests never pass for me locally. Not sure if it's a >>> problem or not... I am tempted to go with Mahadev on this and get this >>> 3.4 release out the door. I would be happy to help manage a 3.4.1 >>> release soon thereafter if we find serious issues. >>> >>> C >>> >>> On Sat, Nov 5, 2011 at 3:01 PM, Flavio Junqueira <[EMAIL PROTECTED]> >>> wrote: >>>> >>>> If 2) is flakey, we need to fix it, no? >>>> >>>> -Flavio >>>> >>>> On Nov 5, 2011, at 6:14 PM, Patrick Hunt wrote: >>>> >>>>> I ran the 1270-1194 patch continually overnight (trunk) in my ci env, >>>>> after ~25 test runs I saw 4 failures: >>>>> >>>>> 1) #402 - QuorumTest.testFollowersStartAfterLeader >>>>> 2) #407 - org.apache.zookeeper.test.FLETest.testLE >>>>> 3) #410 - org.apache.zookeeper.test.AsyncHammerTest.testHammer >>>>> 4) #415 - org.apache.zookeeper.test.AsyncHammerTest.testHammer >>>>> >>>>> 1) client could not connect to reestablished quorum: giving up after >>>>> 30+ seconds. >>>>> 2) known flakey test >>>>> 3) QP failed to shutdown in 30 seconds: >>>>> QuorumPeer[myid=3]0.0.0.0/0.0.0.0:11224 >>>>> 4) QP failed to shutdown in 30 seconds: >>>>> QuorumPeer[myid=1]0.0.0.0/0.0.0.0:11222 >>>>> >>>>> On the plus side no "testearlyleaderabandon" failures. >>>>> >>>>> On the minus side 3/4 are a bit worrysome. Searching back through all >>>>> my previous failures I don't see this happening. Perhaps these changes >>>>> have shifted some timing? My main concern is that this might be caused >>>>> directly by the patch itself.... >>>>> >>>>> Patrick >>>> >>>> flavio >>>> junqueira >>>> >>>> research scientist >>>> >>>> [EMAIL PROTECTED] >>>> direct +34 93-183-8828 >>>> >>>> avinguda diagonal 177, 8th floor, barcelona, 08018, es >>>> phone (408) 349 3300 fax (408) 349 3301 >>>> >>>> >> >> flavio >> junqueira >> >> research scientist >> >> [EMAIL PROTECTED] >> direct +34 93-183-8828 >> >> avinguda diagonal 177, 8th floor, barcelona, 08018, es >> phone (408) 349 3300 fax (408) 349 3301 >> >> >
-
Re: Update on my 1270 testingFlavio Junqueira 2011-11-08, 15:37
I'm currently trying to wrap up ZOOKEEPER-1292, and I can move to
early abandonment once I'm done here. -Flavio On Nov 8, 2011, at 1:20 AM, Camille Fournier wrote: > Sorry you're feeling bad, Patrick! We can take it from here. > > I would really like to get some clarification on this test from some > of the LE experts. What does it really mean that this test is failing? > Is this sort of failure that means that sometimes we have server > startup that takes a bit longer because leader gives up the election, > or will server startup completely hang due to this? If it's the > latter, it should be a high priority fix for 3.4, but if it means that > occasionally startup might have to fail and retry once, it might be > worth worry about in 3.4.1. > > Thoughts? > > C > > On Mon, Nov 7, 2011 at 6:23 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: >> That's fine (direction re 1-4). However my CI branch 3.4 build failed >> over the w/e (once out of four runs). This is AFTER "Preparing for >> release 3.4.0 - take 2" was applied (so testing includes 1270, 1264, >> etc...) >> >> Notice testEarlyLeaderAbandonment is failing. I have attached the log >> file to ZOOKEEPER-1270 JIRA: >> https://issues.apache.org/jira/secure/attachment/12502838/testEarlyLeaderAbandonment5.txt.gz >> >> java.lang.RuntimeException: Waiting too long >> at >> org >> .apache >> .zookeeper >> .server >> .quorum.QuorumPeerMainTest.waitForAll(QuorumPeerMainTest.java:324) >> at >> org >> .apache >> .zookeeper >> .server >> .quorum >> .QuorumPeerMainTest >> .testEarlyLeaderAbandonment(QuorumPeerMainTest.java:195) >> at org.apache.zookeeper.JUnit4ZKTestRunner >> $LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) >> >> Should I reopen 1270, or a new jira, or... ? LMK. >> >> Note - I'm feeling quite ill so I have limited time to provide f/b & >> test for the next day or so. >> >> Patrick >> >> On Sat, Nov 5, 2011 at 12:22 PM, Flavio Junqueira <fpj@yahoo- >> inc.com> wrote: >>> I'm fine with your proposal. -Flavio >>> >>> On Nov 5, 2011, at 8:15 PM, Camille Fournier wrote: >>> >>>> 2 has been flaky for so long, not sure whether it's worth being a >>>> blocker. >>>> The AsyncHammerTests never pass for me locally. Not sure if it's a >>>> problem or not... I am tempted to go with Mahadev on this and get >>>> this >>>> 3.4 release out the door. I would be happy to help manage a 3.4.1 >>>> release soon thereafter if we find serious issues. >>>> >>>> C >>>> >>>> On Sat, Nov 5, 2011 at 3:01 PM, Flavio Junqueira <fpj@yahoo- >>>> inc.com> >>>> wrote: >>>>> >>>>> If 2) is flakey, we need to fix it, no? >>>>> >>>>> -Flavio >>>>> >>>>> On Nov 5, 2011, at 6:14 PM, Patrick Hunt wrote: >>>>> >>>>>> I ran the 1270-1194 patch continually overnight (trunk) in my >>>>>> ci env, >>>>>> after ~25 test runs I saw 4 failures: >>>>>> >>>>>> 1) #402 - QuorumTest.testFollowersStartAfterLeader >>>>>> 2) #407 - org.apache.zookeeper.test.FLETest.testLE >>>>>> 3) #410 - org.apache.zookeeper.test.AsyncHammerTest.testHammer >>>>>> 4) #415 - org.apache.zookeeper.test.AsyncHammerTest.testHammer >>>>>> >>>>>> 1) client could not connect to reestablished quorum: giving up >>>>>> after >>>>>> 30+ seconds. >>>>>> 2) known flakey test >>>>>> 3) QP failed to shutdown in 30 seconds: >>>>>> QuorumPeer[myid=3]0.0.0.0/0.0.0.0:11224 >>>>>> 4) QP failed to shutdown in 30 seconds: >>>>>> QuorumPeer[myid=1]0.0.0.0/0.0.0.0:11222 >>>>>> >>>>>> On the plus side no "testearlyleaderabandon" failures. >>>>>> >>>>>> On the minus side 3/4 are a bit worrysome. Searching back >>>>>> through all >>>>>> my previous failures I don't see this happening. Perhaps these >>>>>> changes >>>>>> have shifted some timing? My main concern is that this might be >>>>>> caused >>>>>> directly by the patch itself.... >>>>>> >>>>>> Patrick >>>>> >>>>> flavio >>>>> junqueira >>>>> >>>>> research scientist >>>>> >>>>> [EMAIL PROTECTED] >>>>> direct +34 93-183-8828 >>>>> >>>>> avinguda diagonal 177, 8th floor, barcelona, 08018, es flavio junqueira research scientist [EMAIL PROTECTED] direct +34 93-183-8828 avinguda diagonal 177, 8th floor, barcelona, 08018, es phone (408) 349 3300 fax (408) 349 3301
-
Re: Update on my 1270 testingCamille Fournier 2011-11-08, 19:01
Anyone know why Patrick's log file might be showing a lot of this
before the error? 2011-11-06 01:02:39,905 [myid:2] - INFO [Thread-76:NIOServerCnxn$StatCommand@655] - Stat command output This test never does a stat call, it uses a ZK client to connect in. This seems strange, perhaps the issue is a test setup one? C On Mon, Nov 7, 2011 at 6:23 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > That's fine (direction re 1-4). However my CI branch 3.4 build failed > over the w/e (once out of four runs). This is AFTER "Preparing for > release 3.4.0 - take 2" was applied (so testing includes 1270, 1264, > etc...) > > Notice testEarlyLeaderAbandonment is failing. I have attached the log > file to ZOOKEEPER-1270 JIRA: > https://issues.apache.org/jira/secure/attachment/12502838/testEarlyLeaderAbandonment5.txt.gz > > java.lang.RuntimeException: Waiting too long > at org.apache.zookeeper.server.quorum.QuorumPeerMainTest.waitForAll(QuorumPeerMainTest.java:324) > at org.apache.zookeeper.server.quorum.QuorumPeerMainTest.testEarlyLeaderAbandonment(QuorumPeerMainTest.java:195) > at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > > Should I reopen 1270, or a new jira, or... ? LMK. > > Note - I'm feeling quite ill so I have limited time to provide f/b & > test for the next day or so. > > Patrick > > On Sat, Nov 5, 2011 at 12:22 PM, Flavio Junqueira <[EMAIL PROTECTED]> wrote: >> I'm fine with your proposal. -Flavio >> >> On Nov 5, 2011, at 8:15 PM, Camille Fournier wrote: >> >>> 2 has been flaky for so long, not sure whether it's worth being a blocker. >>> The AsyncHammerTests never pass for me locally. Not sure if it's a >>> problem or not... I am tempted to go with Mahadev on this and get this >>> 3.4 release out the door. I would be happy to help manage a 3.4.1 >>> release soon thereafter if we find serious issues. >>> >>> C >>> >>> On Sat, Nov 5, 2011 at 3:01 PM, Flavio Junqueira <[EMAIL PROTECTED]> >>> wrote: >>>> >>>> If 2) is flakey, we need to fix it, no? >>>> >>>> -Flavio >>>> >>>> On Nov 5, 2011, at 6:14 PM, Patrick Hunt wrote: >>>> >>>>> I ran the 1270-1194 patch continually overnight (trunk) in my ci env, >>>>> after ~25 test runs I saw 4 failures: >>>>> >>>>> 1) #402 - QuorumTest.testFollowersStartAfterLeader >>>>> 2) #407 - org.apache.zookeeper.test.FLETest.testLE >>>>> 3) #410 - org.apache.zookeeper.test.AsyncHammerTest.testHammer >>>>> 4) #415 - org.apache.zookeeper.test.AsyncHammerTest.testHammer >>>>> >>>>> 1) client could not connect to reestablished quorum: giving up after >>>>> 30+ seconds. >>>>> 2) known flakey test >>>>> 3) QP failed to shutdown in 30 seconds: >>>>> QuorumPeer[myid=3]0.0.0.0/0.0.0.0:11224 >>>>> 4) QP failed to shutdown in 30 seconds: >>>>> QuorumPeer[myid=1]0.0.0.0/0.0.0.0:11222 >>>>> >>>>> On the plus side no "testearlyleaderabandon" failures. >>>>> >>>>> On the minus side 3/4 are a bit worrysome. Searching back through all >>>>> my previous failures I don't see this happening. Perhaps these changes >>>>> have shifted some timing? My main concern is that this might be caused >>>>> directly by the patch itself.... >>>>> >>>>> Patrick >>>> >>>> flavio >>>> junqueira >>>> >>>> research scientist >>>> >>>> [EMAIL PROTECTED] >>>> direct +34 93-183-8828 >>>> >>>> avinguda diagonal 177, 8th floor, barcelona, 08018, es >>>> phone (408) 349 3300 fax (408) 349 3301 >>>> >>>> >> >> flavio >> junqueira >> >> research scientist >> >> [EMAIL PROTECTED] >> direct +34 93-183-8828 >> >> avinguda diagonal 177, 8th floor, barcelona, 08018, es >> phone (408) 349 3300 fax (408) 349 3301 >> >> >
-
Re: Update on my 1270 testingCamille Fournier 2011-11-08, 19:25
Btw, from the stack traces all of the servers seem to be in a healthy
state, complete through leader election and following properly. >From my phone On Nov 8, 2011 2:01 PM, "Camille Fournier" <[EMAIL PROTECTED]> wrote: > Anyone know why Patrick's log file might be showing a lot of this > before the error? > > 2011-11-06 01:02:39,905 [myid:2] - INFO > [Thread-76:NIOServerCnxn$StatCommand@655] - Stat command output > > This test never does a stat call, it uses a ZK client to connect in. > This seems strange, perhaps the issue is a test setup one? > > C > > On Mon, Nov 7, 2011 at 6:23 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > > That's fine (direction re 1-4). However my CI branch 3.4 build failed > > over the w/e (once out of four runs). This is AFTER "Preparing for > > release 3.4.0 - take 2" was applied (so testing includes 1270, 1264, > > etc...) > > > > Notice testEarlyLeaderAbandonment is failing. I have attached the log > > file to ZOOKEEPER-1270 JIRA: > > > https://issues.apache.org/jira/secure/attachment/12502838/testEarlyLeaderAbandonment5.txt.gz > > > > java.lang.RuntimeException: Waiting too long > > at > org.apache.zookeeper.server.quorum.QuorumPeerMainTest.waitForAll(QuorumPeerMainTest.java:324) > > at > org.apache.zookeeper.server.quorum.QuorumPeerMainTest.testEarlyLeaderAbandonment(QuorumPeerMainTest.java:195) > > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > > > > Should I reopen 1270, or a new jira, or... ? LMK. > > > > Note - I'm feeling quite ill so I have limited time to provide f/b & > > test for the next day or so. > > > > Patrick > > > > On Sat, Nov 5, 2011 at 12:22 PM, Flavio Junqueira <[EMAIL PROTECTED]> > wrote: > >> I'm fine with your proposal. -Flavio > >> > >> On Nov 5, 2011, at 8:15 PM, Camille Fournier wrote: > >> > >>> 2 has been flaky for so long, not sure whether it's worth being a > blocker. > >>> The AsyncHammerTests never pass for me locally. Not sure if it's a > >>> problem or not... I am tempted to go with Mahadev on this and get this > >>> 3.4 release out the door. I would be happy to help manage a 3.4.1 > >>> release soon thereafter if we find serious issues. > >>> > >>> C > >>> > >>> On Sat, Nov 5, 2011 at 3:01 PM, Flavio Junqueira <[EMAIL PROTECTED]> > >>> wrote: > >>>> > >>>> If 2) is flakey, we need to fix it, no? > >>>> > >>>> -Flavio > >>>> > >>>> On Nov 5, 2011, at 6:14 PM, Patrick Hunt wrote: > >>>> > >>>>> I ran the 1270-1194 patch continually overnight (trunk) in my ci env, > >>>>> after ~25 test runs I saw 4 failures: > >>>>> > >>>>> 1) #402 - QuorumTest.testFollowersStartAfterLeader > >>>>> 2) #407 - org.apache.zookeeper.test.FLETest.testLE > >>>>> 3) #410 - org.apache.zookeeper.test.AsyncHammerTest.testHammer > >>>>> 4) #415 - org.apache.zookeeper.test.AsyncHammerTest.testHammer > >>>>> > >>>>> 1) client could not connect to reestablished quorum: giving up after > >>>>> 30+ seconds. > >>>>> 2) known flakey test > >>>>> 3) QP failed to shutdown in 30 seconds: > >>>>> QuorumPeer[myid=3]0.0.0.0/0.0.0.0:11224 > >>>>> 4) QP failed to shutdown in 30 seconds: > >>>>> QuorumPeer[myid=1]0.0.0.0/0.0.0.0:11222 > >>>>> > >>>>> On the plus side no "testearlyleaderabandon" failures. > >>>>> > >>>>> On the minus side 3/4 are a bit worrysome. Searching back through all > >>>>> my previous failures I don't see this happening. Perhaps these > changes > >>>>> have shifted some timing? My main concern is that this might be > caused > >>>>> directly by the patch itself.... > >>>>> > >>>>> Patrick > >>>> > >>>> flavio > >>>> junqueira > >>>> > >>>> research scientist > >>>> > >>>> [EMAIL PROTECTED] > >>>> direct +34 93-183-8828 > >>>> > >>>> avinguda diagonal 177, 8th floor, barcelona, 08018, es > >>>> phone (408) 349 3300 fax (408) 349 3301 > >>>> > >>>> > >> > >> flavio > >> junqueira > >> > >> research scientist > >> > >> [EMAIL PROTECTED] > >> direct +34 93-183-8828 > >> > >> avinguda diagonal 177, 8th floor, barcelona, 08018, es
-
Re: Update on my 1270 testingPatrick Hunt 2011-11-08, 22:01
You're right, there is no "stat" usage in this test.
I suspect I know what this is. I just looked at that CI host and it has 2 slots. I bet that some other test (either another ZK or hbase or flume) may have run on that same host/port at the same time my test was running. That would account for the "stat" being seen (across unit tests). It doesn't happen very often as we cycle through ports, but it's likely that's what happened here. So my bad on this, looks like it's a false indication of test failure here. I'll see what I can do on my end from having this happen again. Note that Apache jenkins suffers from this same problem (solaris has 2 slots). There's typically a way to limit this (a feature/plugin for jenkins) but it doesn't look like it's available on Apache jenkins. Patrick On Tue, Nov 8, 2011 at 11:01 AM, Camille Fournier <[EMAIL PROTECTED]> wrote: > Anyone know why Patrick's log file might be showing a lot of this > before the error? > > 2011-11-06 01:02:39,905 [myid:2] - INFO > [Thread-76:NIOServerCnxn$StatCommand@655] - Stat command output > > This test never does a stat call, it uses a ZK client to connect in. > This seems strange, perhaps the issue is a test setup one? > > C > > On Mon, Nov 7, 2011 at 6:23 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: >> That's fine (direction re 1-4). However my CI branch 3.4 build failed >> over the w/e (once out of four runs). This is AFTER "Preparing for >> release 3.4.0 - take 2" was applied (so testing includes 1270, 1264, >> etc...) >> >> Notice testEarlyLeaderAbandonment is failing. I have attached the log >> file to ZOOKEEPER-1270 JIRA: >> https://issues.apache.org/jira/secure/attachment/12502838/testEarlyLeaderAbandonment5.txt.gz >> >> java.lang.RuntimeException: Waiting too long >> at org.apache.zookeeper.server.quorum.QuorumPeerMainTest.waitForAll(QuorumPeerMainTest.java:324) >> at org.apache.zookeeper.server.quorum.QuorumPeerMainTest.testEarlyLeaderAbandonment(QuorumPeerMainTest.java:195) >> at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) >> >> Should I reopen 1270, or a new jira, or... ? LMK. >> >> Note - I'm feeling quite ill so I have limited time to provide f/b & >> test for the next day or so. >> >> Patrick >> >> On Sat, Nov 5, 2011 at 12:22 PM, Flavio Junqueira <[EMAIL PROTECTED]> wrote: >>> I'm fine with your proposal. -Flavio >>> >>> On Nov 5, 2011, at 8:15 PM, Camille Fournier wrote: >>> >>>> 2 has been flaky for so long, not sure whether it's worth being a blocker. >>>> The AsyncHammerTests never pass for me locally. Not sure if it's a >>>> problem or not... I am tempted to go with Mahadev on this and get this >>>> 3.4 release out the door. I would be happy to help manage a 3.4.1 >>>> release soon thereafter if we find serious issues. >>>> >>>> C >>>> >>>> On Sat, Nov 5, 2011 at 3:01 PM, Flavio Junqueira <[EMAIL PROTECTED]> >>>> wrote: >>>>> >>>>> If 2) is flakey, we need to fix it, no? >>>>> >>>>> -Flavio >>>>> >>>>> On Nov 5, 2011, at 6:14 PM, Patrick Hunt wrote: >>>>> >>>>>> I ran the 1270-1194 patch continually overnight (trunk) in my ci env, >>>>>> after ~25 test runs I saw 4 failures: >>>>>> >>>>>> 1) #402 - QuorumTest.testFollowersStartAfterLeader >>>>>> 2) #407 - org.apache.zookeeper.test.FLETest.testLE >>>>>> 3) #410 - org.apache.zookeeper.test.AsyncHammerTest.testHammer >>>>>> 4) #415 - org.apache.zookeeper.test.AsyncHammerTest.testHammer >>>>>> >>>>>> 1) client could not connect to reestablished quorum: giving up after >>>>>> 30+ seconds. >>>>>> 2) known flakey test >>>>>> 3) QP failed to shutdown in 30 seconds: >>>>>> QuorumPeer[myid=3]0.0.0.0/0.0.0.0:11224 >>>>>> 4) QP failed to shutdown in 30 seconds: >>>>>> QuorumPeer[myid=1]0.0.0.0/0.0.0.0:11222 >>>>>> >>>>>> On the plus side no "testearlyleaderabandon" failures. >>>>>> >>>>>> On the minus side 3/4 are a bit worrysome. Searching back through all >>>>>> my previous failures I don't see this happening. Perhaps these changes |