Below is a state of hbase 0.95/trunk unit tests (Includes a little taxonomy
of test failure type definitions).
On Andrew's ec2 build box, 0.95 is passing most of the time:
It is not as good on Apache build box but it is getting better:
On Apache, I have seen loads up in the 500s and all file descriptors used
according to the little resources report printed at the end of each test.
If these numbers are to be believed (TBD), we may never achieve 100% pass
rate on Apache builds.
Andrew's ec2 builds run the integration tests too where the apache builds
do not -- sometimes we'll fail an integration test run which makes the
Andrew ec2 red/green ratio look worse that it actually is.
Trunk builds lag. They are being worked on.
We seem to be over the worst of the flakey unit tests. We have a few
stragglers still but they are being hunted down by the likes of the
merciless Jimmy Xiang and Jeffrey Zhong.
The "zombies" have been mostly nailed too (where "zombies" are tests that
refuse to die continuing after the suite has completed causing the build to
fail). The zombie trap from test-patch.sh was ported over to apache and
ec2 build and it caught the last of undying.
We are now into a new phase where "all" tests pass but the build still
fails. Here is an example:
only clue I have to go on is the fact that when we fail, the number of
tests run is less than the total that shows for a successful run.
Unless anyone has a better idea, to figure why the hang, I compare the list
of tests that show in a good run vs. those of a bad run. Tests that are in
the good run but missing from the bad run are deemed suspect. In the
absence of other evidence or other ideas, I am blaming these "invisibles"
for the build fail.
Here is an example:
This is a good 0.95 hadoop2 run (notice how we are running integration
tests tooooo and they succeed!! On hadoop2!!!!):
In hbase-server module:
Tests run: 1491, Failures: 0, Errors: 0, Skipped: 19
This is a bad run:
Tests run: 1458, Failures: 0, Errors: 0, Skipped: 18
If I compare tests, the successful run has:
> Running org.apache.hadoop.hbase.regionserver.wal.TestHLogSplitCompressed
... where the bad run does not show the above test.
TestHLogSplitCompressed has 34 tests one of which is disabled so that
would seem to account for the discrepancy.
I've started to disable tests that fail likes this putting them aside for
original authors or the interested to take a look to see why they fail
occasionally. I put them aside so we can enjoy passing builds in the
meantime. I've already moved aside or disabled a few tests and test
TestHCM.testDeleteForZKConnLeak was disabled
... and a few others.
Finally (if you are still reading), I would suggest that test failures in
hadoopqa are now more worthy of investigation. Illustrative is what
happened recently around "HBASE-8983 HBaseConnection#deleteAllConnections"
where the patch had +1s and on its first run, a unit test failed (though it
passed locally). The second run obscured the first run's failure. After
digging by another, the patch had actually broken the first test (though it
looked unrelated). I would suggest that now tests are healthier, test
failures are worth paying more attention too.